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Abstract 

Today’s  society  has  become  more  dependent  on  the  integrity  and  protection  of 
digital  information  used  in  daily  transactions  resulting  in  an  ever  increasing  need 
for  information  security.  Additionally,  the  need  for  faster  and  more  secure  crypto¬ 
graphic  algorithms  to  provide  this  information  security  has  become  paramount.  Hard¬ 
ware  implementations  of  cryptographic  algorithms  provide  the  necessary  increase  in 
throughput,  but  at  a  cost  of  leaking  critical  information.  Side  Channel  Analysis 
(SCA)  attacks  allow  an  attacker  to  exploit  the  regular  and  predictable  power  signa¬ 
tures  leaked  by  cryptographic  functions  used  in  algorithms  such  as  RSA.  The  RSA 
public  key  cryptographic  algorithm  is  particularly  vulnerable  to  SCA  attack  in  the 
timing  of  the  multiplication  and  squaring  operations  used  in  the  modular  exponen¬ 
tiation  process.  Hardware  obfuscation  is  used  to  modify  the  circuit’s  hardware  to 
intentionally  conceal  its  functionality  from  an  attacker.  In  this  research  the  focus  is 
on  a  means  to  counteract  this  vulnerability  by  creating  a  Critically  Low  Observable 
Anti- Tamper  Keeping  Circuit  (CLOAK)  capable  of  continuously  changing  the  way  it 
functions  in  both  power  consumption  and  timing.  A  Field-Programmable  Gate  Array 
(FPGA)  based  Encryption  System  testbed  was  developed  to  rapidly  prototype  and 
conduct  SCA  of  protected  and  unprotected  cryptographic  algorithms.  This  research 
has  determined  that  a  polymorphic  circuit  design  capable  of  varying  circuit  power 
consumption  and  timing  can  protect  a  cryptographic  device  from  an  Electromag¬ 
netic  Analysis  (EMA)  attacks.  In  essence,  we  are  effectively  CLOAKing  the  circuit 
functions  from  an  attacker. 
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DYNAMIC  POLYMORPHIC 


RECONFIGURATION  TO  EFFECTIVELY 
“CLOAK”  A  CIRCUIT’S  FUNCTION 

I.  Introduction 

The  proliferation  of  computing  technology  has  brought  with  it  an  increased  de¬ 
mand  for  information  protection.  Today  much  of  our  lives  are  digitized  and  stored  on 
computer  systems.  These  computer  systems  in  one  way  or  another  handle  the  ma¬ 
jority  of  transactions  conducted  on  a  daily  basis.  Society  has  become  more  and  more 
dependent  on  the  integrity  and  protection  of  the  digital  information  used  in  these 
transactions  causing  an  ever  increasing  need  for  digital  security.  The  United  States 
(U.S.)  military,  including  the  United  States  Air  Force  (USAF),  relies  on  its  military 
technological  superiority  to  maintain  our  country’s  military  dominance.  Operations 
conducted  on  Department  of  Defense  (DoD)  information  networks  are  of  particular 
importance  to  the  United  States  Cyber  Command  (USCYBERCOM)  and  the  USAF 
mission. 

1.1  Motivation 

The  DoD  considers  any  technology  that  makes  a  significant  contribution  to  the 
military  potential  of  our  country  to  be  Military  Critical  Technology  (MCT)  [1],  Com¬ 
promise  of  MCT,  such  as  cryptographic  algorithms,  can  have  disastrous  consequences 
for  the  end  user  and  security  of  the  nation.  The  Federal  Information  Security  Man¬ 
agement  Act  (FISMA)  of  2002  describes  the  importance  of  information  security  to 
national  defense  and  the  economic  security  of  the  nation  [5].  Cryptographic  algo- 
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rithms  were  adopted  by  Federal  agencies  and  private  industry  to  provide  information 
security  to  protect  information  by  providing  integrity,  confidentiality,  and  availability 
[5].  In  other  words,  information  security  provides  confidence  in  the  authenticity  of 
data,  provides  a  way  to  keep  information  secret,  and  provides  information  in  a  timely 
and  reliable  manner.  Cryptographic  algorithms  provide  differing  levels  of  security 
based  on  the  time  value  and  sensitivity  requirements  of  the  data  being  secured. 

Threats  to  this  information  security  include  an  adversary  being  able  to  hack  into 
our  live  video  intelligence  feeds  [20]  or  information  systems  and  retrieve  valuable 
information.  Vital  information  about  the  300  billion  dollar  Joint  Strike  Fighter  (JSF) 
program  was  stolen  through  such  means  [34],  To  execute  a  successful  attack  on 
a  system,  an  adversary  only  has  to  compromise  the  weakest  link  in  a  system  to 
gain  access  to  critical  technology.  Side  Channel  Analysis  (SCA)  attacks  on  secure 
communications  devices  can  be  conducted  without  end  users  knowing  their  emissions 
are  being  exploited.  These  attacks  only  exemplify  the  need  for  more  robust  secure 
communication  protocols . 

Discovered  by  Paul  Kocher  et  al.  [39],  power  analysis  attacks  measure  the  power 
dissipation  of  a  circuit  when  a  transistor  is  switched  from  0  to  1,  and  conversely  1 
to  0.  Power  analysis  is  conducted  by  placing  a  meter  between  power  and  ground 
of  the  circuit  to  recover  information  regarding  cryptographic  operations  being  per¬ 
formed  [39].  Timing  analysis  attacks  exploit  the  slight  differences  in  the  amount  of 
time  required  to  perform  encryption  operations  [38],  [40].  These  slight  differences 
in  encryption  operations  leak  information  about  the  secret  key  being  used.  Electro¬ 
magnetic  Analysis  (EMA)  attacks  are  similar  to  power  analysis  except  they  do  not 
require  tampering  of  the  device  under  attack.  EMA  exploits  the  TEMPEST1  (i.e. , 
electromagnetic  radiation)  information  leaked  by  electronic  devices  commonly  called 

1  TEMPEST  is  a  codename  referring  to  the  intelligence-bearing  electromagnetic  radiation  ema¬ 
nating  from  a  circuit  while  in  operation 
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compromising  emanations  [55] . 

The  National  Institute  of  Standards  and  Technology  (NIST)  published  the  Federal 
Information  Processing  Standards  (FIPS)  publication  201  pursuant  to  FISMA  of  2002 
which  specifies  the  architecture  and  technical  requirements  as  a  common  identification 
standard  for  Federal  employees  and  contractors [6].  This  FIPS  publication  specifies 
RSA[56]  be  used  at  a  minimum  for  all  Personal  Identification  Verification  (PIV)  cards. 
PIV  cards  are  used  throughout  the  DoD  to  identify  users  for  facility  access  and  secure 
communication. 

Hardware  protection  systems  seek  to  secure  an  electronic  device  that  may  be  out 
of  our  physical  control.  The  consumer  is  assumed  to  be  a  trusted  user  and  is  able  to 
examine  the  technology  at  their  leisure.  In  the  case  of  the  new  iPhone  4G  prototype, 
by  Apple®,  being  left  in  a  bar  in  Redwood  City,  California,  the  trusted  user  lost 
positive  control  of  the  device.  Eventually  the  prototype  found  its  way  into  the  hands 
of  Gizmodo®.  Apple  was  able  to  remotely  disable  and  wipe  the  phone  which  removes 
the  software  vulnerability,  but  did  not  stop  the  hardware  from  being  subsequently 
dissected  and  features  documented  for  release  to  the  public.  The  reverse  engineering 
and  release  of  proprietary  phone  features  and  capabilities  happened  months  before 
Apple’s  big  debut  of  the  next  generation  iPhone.  The  consumer  is  assumed  to  be  a 
trusted  user  especially  in  the  case  of  a  military  only  secure  communications  device, 
the  potential  still  exists  that  the  device  be  lost  or  stolen,  and  subsequently  fall  into 
the  hands  of  a  less  than  trustworthy  user. 

This  research  directly  supports  the  USAF  mission  in  the  defense  of  the  U.S.  and 
its  global  interests  through  the  safeguarding  of  MCTs  so  that  weapon  system  capa¬ 
bilities  stay  out  of  adversarial  hands.  Attacks  have  been  shown  to  exploit  timing 
vulnerabilities  in  modern  cryptographic  systems  enabling  the  recovery  of  the  entire 
secret  key  [38].  Once  the  attacker  compromises  the  secret  key  the  security  of  the 
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entire  system  is  compromised. 


1.2  Problem  Statement 

Current  public-key  encryption  systems  do  not  provide  adequate  software  or  hard¬ 
ware  protection  against  side  channel  analysis  attacks.  In  the  case  of  cyrptographic 
circuitry,  they  are  particularly  susceptible  to  side  channel  and  power  analysis.  The 
RSA  [56]  public-key  cryptographic  algorithm  is  particularly  vulnerable  to  SCA  attacks 
in  the  timing  of  multiplication  and  squaring  operations  used  in  modular  exponentia¬ 
tion  accomplished  in  the  private-key  operations.  These  timing  attack  vulnerabilities 
were  first  discussed  by  Paul  Kocher  [38]  in  1996.  There  has  since  been  ample  docu¬ 
mentation  on  the  timing  attack  vulnerabilities  affecting  current  public-key  encryption 
algorithms.  This  check  and  balance  of  these  public  algorithms  help  to  ensure  their 
security  by  openly  testing  their  limits  and  hireling  ways  to  increase  their  ability  to  de¬ 
feat  these  vulnerabilities.  While  defensive  countermeasures  exist  in  modern  designs, 
these  do  not  address  all  methods  used  by  malicious  adversaries. 

The  key  vulnerability  to  a  side  channel  analysis  attack  in  cryptographic  algorithms 
like  RSA  is  the  predictable  time  required  to  calculate  the  modular  exponentiation  in 
the  private-key  operations.  Timing  attacks  can  be  conducted  by  physically  connecting 
a  meter  between  power  and  ground  on  the  chip  or  by  non  physical  means  to  measure 
the  TEMPEST  information  (compromising  emanations)  leaked  by  the  circuit.  What 
if  we  can  build  a  dynamic  circuit  that  can  continuously  reconfigure  itself  in  order  to 
successfully  obfuscate  its  intent /purpose  from  an  observer /adversary? 

1.3  Research  Objectives  and  Contributions 

The  desired  outcome  of  this  research  is  the  creation  and  analysis  of  a  Critically  Low 
Observable  Anti- Tamper  Keeping  Circuit  (CLOAK)  that  is  capable  of  obfuscating  the 
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circuit  function  from  side  channel  analysis  by  an  observer /adversary.  This  research 
seeks  the  creation  and  analysis  of  a  CLOAK  countermeasure  to  defend  against  SCA 
attacks  on  a  cryptographic  algorithm  in  order  to  protect  the  secret  key.  This  research 
is  broken  down  into  three  different  sections  with  an  ultimate  goal  of  protecting  the 
secret  key. 

1.3.1  Polymorphic  Circuit  Design. 

The  key  to  providing  a  secure  cryptographic  system  is  the  protection  of  the  se¬ 
cret  key  from  the  adversary.  This  research  leverages  and  extends  previous  work  by 
Roy  Porter  [54]  [53]  and  Camdon  Cady  [22]  with  polymorphic  circuits  capable  of 
changing  the  way  they  function  based  on  an  input  key.  A  polymorphic  circuit  design 
is  developed  and  implemented  on  a  Field-Programmable  Gate  Array  (FPGA).  This 
polymorphic  circuit  design  is  capable  of  continuously  changing  the  way  it  functions 
to  obfuscate  its  side  channel  signature  in  both  power  consumption  and  timing.  This 
effectively  creates  the  CLOAK  countermeasure  for  hardware  obfuscation.  The  Elec¬ 
tromagnetic  (EM)  side  channel  of  the  circuit  is  then  characterized  to  ensure  proper 
functionality  by  implementing  it  into  the  modular  exponentiation  circuitry  of  a  hard¬ 
ware  implementation  of  RSA. 

1.3.2  Implement  RSA  with  Reconfiguration. 

An  implementation  of  RSA  is  designed  and  synthesized  on  an  FPGA  using  a 
structural  approach,  which  is  a  design  based  on  the  components  and  their  intercon¬ 
nects.  This  approach  gives  the  developer  more  control  over  the  implementation  of  the 
modular  exponentiation  portion  of  the  cryptographic  algorithm.  The  first  implemen¬ 
tation  uses  an  unprotected  implementation  of  the  modular  exponentiation  circuitry 
for  baseline  analysis.  The  second  implementation  uses  the  polymorphic  circuit  design 
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for  the  modular  exponentiation  circuitry. 


1.3.3  Analyze  Side  Channel  Signatures. 

The  EM  side  channel  of  the  original  implementation  of  RSA  is  characterized  to 
set  an  initial  baseline  for  subsequent  analysis  of  the  side  channel.  The  baseline  is 
compared  to  the  modified  RSA  implementation  to  characterize  the  extent  the  circuit 
is  able  to  resist  timing  attacks  and  protect  the  secret  key.  It  is  expected  that  the 
combination  of  elements  in  this  approach  culminate  in  a  proof  of  concept  polymorphic 
circuit  design  that  enhances  the  system  protective  countermeasures  by  obfuscating 
operations. 

1.4  Thesis  Organization 

The  work  presented  in  this  thesis  is  organized  into  five  main  sections.  Chapter  2 
follows  the  introduction  and  provides  background  information  on  related  research,  to 
include  reconhgurablc  computing,  cryptographic  algorithms,  hardware  obfuscation, 
circuit  vulnerabilities,  and  three  types  of  side  channel  analysis.  Chapter  3  details 
the  methodology  of  my  approach  to  the  implementation  and  evaluation  of  the  stated 
objectives.  Chapter  4  details  the  results  of  the  experiments  and  any  adjustments 
required  to  facilitate  the  stated  objectives.  Finally,  Chapter  5  provides  a  summary 
of  the  relevance  the  work  accomplished  and  the  major  contributions  of  this  thesis. 
Additionally,  it  contains  recommendations  for  future  directions  for  extending  this 
research. 
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II.  Literature  Review 


The  following  chapter  serves  to  orient  the  reader  with  a  diverse  range  of  topics 
used  in  the  creation  and  protection  of  MCTs  through  hardware  obfuscation.  This 
chapter  is  structured  as  follows:  Section  2.1  covers  the  strengths  and  weaknesses  of 
reconfigurable  computing  giving  particular  emphasis  on  FPGAs  and  the  motivation 
behind  its  use  in  this  research.  The  next  section,  Section  2.2,  covers  some  common 
cryptographic  algorithms  and  key  generation  techniques  for  use  in  information  secu¬ 
rity.  Section  2.3  describes  the  concept  of  obfuscating  hardware  operations.  Section 
2.4  details  four  classes  of  circuit  vulnerabilities.  Finally,  the  power,  timing,  and  EM 
analysis  categories  of  SCA  attacks  are  developed  to  include  current  attacks  and  coun¬ 
termeasures  of  each  and  how  they  fit  in  with  this  research.  These  categories  are 
detailed  in  Sections  2.5,  2.6,  and  2.7  respectively. 

2.1  Reconfigurable  Computing 

Reconfigurable  computing  refers  to  the  ability  for  a  system  to  reprogram  itself  or 
redefine  the  way  the  circuit  functions  at  run-time.  This  idea  of  reconfigurable  com¬ 
puting  has  been  in  existence  for  almost  half  a  century  [32],  An  Applications  Specific 
Integrated  Circuit  (ASIC)  is  an  integrated  circuit  that  is  designed  for  a  specific  design 
purpose  and  that  design  is  set  at  the  manufacturer.  Any  small  change  in  an  ASIC 
design  would  force  a  redesign  and  refabrication  of  the  entire  chip,  which  is  an  expen¬ 
sive  and  time  consuming  process  [29].  FPGAs  on  the  other  hand  are  a  reconfigurable 
systems  capable  of  having  their  operations  defined  after  the  chip  is  manufactured. 
The  primary  platform  for  reconfigurable  computing  is  FPGAs. 

The  primary  advantages  to  using  FPGAs  over  ASICs  is  due  to  the  reconfigurable 
nature  of  the  circuitry.  The  ability  to  define  or  redefine  a  circuit’s  purpose  at  the  user 
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level  or  in  the  field  can  have  great  advantages  especially  if  this  chip  is  hardwired  into 
a  deployed  system.  There  may  be  a  need  to  modify  a  systems  purpose  or  the  way 
a  system  operates  based  on  the  needs  of  the  user  of  the  end  system.  The  capability 
of  obfuscating  a  circuit’s  operations  from  an  adversary  by  changing  how  a  system  or 
component  does  its  routine  operations  can  have  unique  applications  in  information 
security.  This  advantage  can  come  at  the  cost  of  performance  since  FPGAs  still  lag 
behind  ASICs  in  clock  speed  and  throughput  [29]. 

2.1.1  FPGAs. 

An  FPGA  is  a  two-dimensional  array  of  reprogrammable  gates  and  other  assorted 
functional  units  combined  together  to  carry  out  different  logical  functions.  FPGAs 
can  be  reprogrammed  by  the  customer  multiple  times  by  downloading  a  bitstream 
through  the  use  of  propriety  software  and  a  Joint  Test  Action  Group  (JTAG)  interface. 
During  this  configuration  process  the  user  defines  the  Very-High-Speed  Integrated 
Circuit  (VHSIC)  hardware  circuit  description  using  the  VHSIC  Hardware  Description 
Language  (VHDL).  The  proprietary  Xilinx®or  Altera®software  then  transforms  the 
users  VHDL  circuit  description  specifications  into  a  bitstream  to  be  downloaded  and 
programmed  on  the  chip.  Modern  FPGAs  use  a  combination  of  soft-core,  embedded 
microcontroller,  and  Configurable  Logic  Blocks  (CLBs)  to  provide  a  flexible  platform 
for  rapid  system  development. 

While  each  manufacturer  has  a  different  nomenclature  for  the  architectural  ele¬ 
ments  of  their  FPGAs,  the  basic  design  is  fairly  standard.  The  terms  used  in  this 
research  are  consistent  with  the  Xilinx  Virtex®-5  [15]  and  Virtex-6  [16]  FPGAs.  Each 
FPGA  contains  top-level  logic  elements/structures  called  CLBs.  Each  CLB  contains 
two  slices  which  contain  equivalent  circuitry  such  as  function  generators  (i.e.,  Look-Up 
Table  (LUT)),  storage  elements,  arithmetic  logic  gates,  multiplexers,  and  fast  carry 


look  ahead  chains.  These  CLBs  are  interconnected  by  a  reconfigurable  switch  matrix 
for  routing.  Ultimately,  all  FPGA  technology  is  based  on  LUT  structures  to  give  the 
user  programmable  logic  function  generators.  LUTs  are  multiple  input  and  single 
output  structures  that  give  a  desired  output  based  on  specified  input  combinations. 
The  Virtex-5  FPGAs  is  built  using  65 nm  process  technology  and  are  implemented 
using  6-input  LUT  technology.  The  Virtex-6  FPGAs  are  also  implemented  using  6- 
input  LUT  technology,  but  are  built  on  a  smaller  40 nm  process  technology  to  decrease 
power  consumption. 

Data  encryption  is  a  highly  repetitive  process  where  FPGAs  and  reconfigurable 
computing  have  been  shown  to  increase  performance  of  systems  by  leveraging  hard¬ 
ware  functions  to  execute  these  repetitive  processes  [29].  FPGAs  enable  rapid  proto¬ 
typing  while  exploiting  the  performance  gains  of  using  hardware  functions.  Typically 
to  reconfigure  an  FPGA  the  user  must  take  the  system  offline  to  download  a  new 
bitstream  design. 

2.1.2  Run-time  and  Partial  Reconfiguration. 

Reconfiguration  in  an  FPGA  can  be  executed  in  different  ways.  The  simplest 
and  most  common  way  is  to  take  the  FPGA  off-line  while  it  is  loaded  with  a  new 
bitstream.  The  ability  to  reconfigure  a  circuit  leads  to  an  area  of  particular  interest, 
the  ability  to  reconfigure  the  system  while  it  is  still  in  operation.  Altera  defines  run¬ 
time  reconfiguration  as  the  ability  to  modify  or  change  the  functional  configuration  of 
the  device  during  operation,  through  hardware  or  software  changes  [7] .  The  concept  of 
being  able  to  change  portions  of  a  system’s  circuitry  in  run-time  instead  of  taking  the 
system  offline  to  program  the  bitstream  can  have  unique  implications  in  information 
security. 

Partial  Reconfiguration  (PR)  for  an  FPGA  is  a  Xilinx  design  flow  that  attempts  to 
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create  reconfiguration  regions  or  zones  in  an  FPGA  device,  so  that  one  region  can  be 
reconfigured  while  the  remainder  of  the  FPGA  continues  to  operate  normally  [7].  The 
ability  to  change  only  a  portion  of  a  design  while  leaving  others  portions  untouched 
can  be  a  powerful  tool.  This  feature  gives  the  designer  the  ability  to  perform  repetitive 
processes  on  hardware  to  increase  performance,  while  maintaining  the  flexibility  of  a 
software  solution  [29].  Stone  et  ah,  introduced  a  design  where  individual  LUTs  could 
be  reconfigured  while  the  FPGA  was  still  in  operation  [61]. 

Stone’s  design  has  some  promising  applications,  but  comes  with  inherent  limi¬ 
tations.  The  reconfiguration  circuit  is  capable  of  operating  at  a  higher  frequency 
compared  the  rest  of  the  FPGA,  which  minimizes  downtime,  but  as  the  design  gets 
larger  the  cost  starts  outweighing  the  benefit.  This  design  allowed  the  design  to  be 
dynamically  altered,  but  the  routing  cannot  be  changed.  Additionally,  each  design 
has  pre-compiled  bitstreams  that  must  be  stored  on  the  system.  A  solution  that  is 
somewhere  in  the  middle  would  be  optimal. 

2.1.3  Polymorphic  Reconfiguration. 

Some  polymorphic  gate  designs  propose  non-traditional  control  variables  such  as 
temperature  or  voltage  to  create  polymorphic  gates  originally  proposed  by  Stoica  et 
ah,  [60].  These  polymorphic  gates  were  then  combined  to  create  polymorphic  circuits 
[58].  These  circuits  are  not  very  practical  for  rapid  design  and  employment  since  the 
technology  relies  on  a  specific  fabrication  of  the  underlying  gate  structures.  Also,  the 
external  conditions  required  to  create  these  polymorphic  changes  in  the  circuits  were 
not  always  easily  controlled  and  could  pose  a  serious  threat  to  the  standard  operation 
of  the  circuit.  Critical  applications  would  require  a  more  stable  polymorphic  design 
that  is  not  as  susceptible  to  random  environmental  impacts.  In  addition,  the  use 
of  external  conditions  for  directing  the  reconfiguration  process  also  provides  another 
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avenue  of  attack  for  adversaries. 


Polymorphic  reconfiguration,  as  defined  by  the  Air  Force  Institute  of  Technology 
(AFIT),  is  a  combination  of  polymorphic  gates  or  decision  circuits  to  create  a  dy¬ 
namically  changing  circuit  design  that  is  capable  of  changing  the  way  it  functions 
based  on  an  input  key.  A  basic  example  of  a  polymorphic  gate  is  created  by  replacing 
any  2-input  gate  with  a  4  to  1  Multiplexer  (MUX)  as  defined  by  [22],  This  can  be 
accomplished  by  connecting  the  gate  inputs  to  the  MUX  input  select  lines  and  the 
additional  input  lines  of  the  MUX  are  driven  by  the  truth  table  of  the  gate  function 
you  wish  to  create.  These  additional  inputs  can  be  driven  to  a  specific  level  manually 
or  driven  by  external  key  generation  circuitry  causing  the  polymorphic  gate  to  func¬ 
tion  like  a  combinational  lock  to  unlock  the  gates  function.  An  example  of  converting 
an  AND  gate  into  a  polymorphic  AND  gate  is  shown  in  Figure  1. 


(a)  2-input  AND  (b)  2-input  Poly-  (c)  Polymorphic 

gate  morphic  Gate  gate  configured  as 

an  AND 

Figure  1.  Polymorphic  Gate  Conversion 

Other  gates  like  polymorphic  switches  may  also  be  used  to  create  polymorphic 
circuits.  A  polymorphic  switch  is  a  2-input,  2-output  switch  that  uses  a  single  control 
line  to  drive  the  input/output  selection  for  the  gate.  An  example  of  the  polymorphic 
switch  and  its  operation  is  shown  in  Figure  2.  All  these  techniques  can  all  be  used  to 
enhance  the  area  of  information  security  and  at  the  heart  of  information  security  are 
the  cryptographic  algorithm  that  protect  the  information. 
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0  1 

(a)  Switch  with  0  applied  at  (b)  Switch  with  1  applied  at 

control  line  control  line 

Figure  2.  Polymorphic  Switch 
2.2  Cryptographic  Algorithms 

Cryptographic  algorithms  today  are  primarily  used  to  protect  the  transmission 
and  storage  of  information  in  digital  form.  All  cryptographic  algorithms  in  one  way 
or  another  attempt  to  take  the  plaintext1  and  scramble  it  to  create  ciphertext2  in 
a  process  called  encryption  or  enciphering.  The  reverse  is  the  act  of  recovering  the 
plaintext  from  the  ciphertext,  called  decryption  or  deciphering. 

Information  security  is  essential  to  not  only  national  defense,  but  also  the  eco¬ 
nomic  security  of  the  nation.  Many  of  the  transactions  conducted  online  require  the 
use  of  public-key  encryption  and  digital  signature  schemes.  Federal  employees  and 
contractors  throughout  the  DoD  use  PIV  cards  as  a  means  of  identification  of  users  for 
facility  access  and  for  the  use  in  secure  communication.  These  cryptographic  systems 
use  public-key  cryptographic  algorithms  with  a  set  of  encryption  transformations  and 
a  set  of  decryption  transformations,  as  in  RSA,  to  ensure  information  security.  Every 
type  of  cryptographic  system  requires  some  sort  of  private  or  secret  key  to  enable  the 
security  of  the  system.  In  a  symmetric-key  cryptographic  algorithm  the  encryption 
and  decryption  keys  are  the  same  and  are  known  by  both  sender  and  receiver,  but  in 
a  public-key  algorithm  the  encryption  and  decryption  keys  are  different.  The  public 

1The  Plaintext  is  the  original  message  to  be  sent  or  received. 

2The  Ciphertext  is  the  scrambled  or  hidden  message  that  may  be  trasmitted  over  non-secure 
channels. 
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key  is  used  for  encryption  while  a  secret  key  is  used  for  decryption.  In  the  next  few 
sections,  some  methods  of  generating  and  protecting  the  key  are  discussed,  followed 
by  a  popular  symmetric-key  algorithm  (AES),  and  lastly  one  of  the  most  commonly 
used  public-key  cryptographic  algorithms  (RSA)  is  discussed. 

2.2.1  Key  Generation  and  Protection. 

All  modern  cryptographic  algorithms  rely  on  the  use  and  protection  of  a  secret  key 
regardless  of  the  type  of  cryptosystem.  Encryption/decryption  systems  typically  fall 
into  two  different  categories:  symmetric-key  and  public-key.  Both  algorithms  require 
a  secret  key.  In  a  symmetric-key  cryptosystem  it  is  the  secret  key  and  in  a  public-key 
cryptosystem  it  is  the  private  key.  If  this  key  is  compromised,  the  security  of  the 
entire  system  is  compromised,  regardless  of  the  particular  algorithm  used  [57] .  If  one 
could  analyze  a  system  and  retrieve  the  secret  key  the  security  of  the  system  would 
be  eliminated. 

One  of  the  advantages  of  using  a  public-key  cryptosystem  like  RSA  over  a  symmetric- 
key  cryptosystem  is  that  a  secure  channel  is  not  needed  to  distribute  keys  [57].  It 
is  much  easier  to  use  a  public-key  cryptosystem  since  the  public-key  can  be  freely 
distributed  and  does  not  need  to  be  kept  secret.  In  a  symmetric-key  cryptosystem 
the  key  must  not  only  be  generated,  but  then  it  has  to  be  distributed  to  all  parties 
before  secure  communication  can  commence.  Key  generation  requires  a  key  of  suffi¬ 
cient  size  and  be  “random”  in  the  sense  that  the  probability  of  any  particular  value 
being  selected  must  be  sufficiently  small  [45]. 

This  section  discusses  two  general  techniques  for  key  generation:  random  and 
pseudorandom.  “In  the  classical  (Kolmogorov)  sense,  a  string  of  bits  is  random  if 
it  cannot  be  described  by  a  shorter  string  than  itself’  [42],  A  random  bit  generator 
can  be  used  to  generate  a  uniformly  distributed  random  number  that  would  force 
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an  adversary  to  guess  2n  possible  keys,  where  n  is  the  number  of  bits  in  the  key.  A 
truly  random  key  is  difficult  to  produce,  but  would  provide  the  best  protection  in  a 
cryptosystem.  However,  generation  of  true  random  bits  is  an  inefficient  procedure  in 
most  practical  environments  [45]. 

Pseudorandom  Number  Generators  (PRNGs)  are  typically  used  in  cryptographic 
systems  in  the  key  generation  process.  If  the  appropriate  PRNG  is  selected  it  can 
provide  adequate  protection  in  the  creation  of  pseudo  random  seeds.  PRNGs  often 
use  a  Linear  Feedback  Shift  Register  (LSFR),  shown  in  Figure  3,  to  create  the  pseudo¬ 
random  bit  stream.  A  LSFR  is  basically  a  shift  register  that  has  its  input  connected 


to  some  combination  of  bits  already  contained  within  the  register.  A  simple  example 
of  a  LFSR  would  be  to  have  some  of  the  register  bits  XORed  together  to  form  the 
input  bit  into  the  shift  register.  A  LSFR  can  be  described  by  an  associated  feedback 
polynomial  that  shows  which  coefficients  are  tapped  to  determine  the  input  bit.  Using 
the  recommended  tap  locations  in  [3]  to  achieve  a  maximum-length  LFSR  counter 
for  a  4-Bit  LSFR  there  is  2n  —  1  =  24  —  1  =  15  possible  different  numbers  before 
repeating.  This  gives  us  a  feedback  polynomial  shown  in  Equation  1. 

x4  +  x3  +  1  (1) 
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2.2.2  AES. 


The  Advanced  Encryption  Standard  (AES)  is  a  symmetric- key  cryptographic  al¬ 
gorithm.  A  symmetric-key  cryptographic  algorithm  uses  the  same  key  to  encrypt  and 
decrypt  a  message.  This  concept  is  shown  in  Figure  4. 


Encryption  '  Decryption 

Figure  4.  AES  -  Symmetric-Key  Cryptographic  Algorithm 


Plaintext 


Plaintext 


AES  is  currently  approved  by  the  NIST  to  be  used  by  Federal  agencies  to  provide 
cyptographic  protection  for  sensitive  (unclassified)  information.  AES  provides  128  bit 
block  cipher  that  is  capable  of  using  128,  192,  or  256  bit  keys  to  encrypt  and  decrypt 
[4],  If  the  circuit  components  that  implement  these  functions  can  be  identified,  the  key 
can  be  recovered.  The  AES  algorithm  bases  its  security  on  the  ability  to  diffuse  and 
confuse  the  message  and  key  together  to  generate  the  ciphertext.  The  five  functions 
used  in  AES  are: 

•  KeyExpansion  -  This  routine  is  used  to  generate  a  series  of  Round  Keys  from 
the  Cipher  Key.  It  is  the  first  function  to  be  performed  when  an  encryption  or 
decryption  operation  begins.  It  takes  as  an  input  the  128-,  192-,  or  256-bit  AES 
key  and  generates  10,  12,  or  14  round  keys  depending  on  the  key  length. 

•  SubBytes  -  This  is  a  transformation  function  that  takes  a  4  x  4  State3  array  of 
bytes  and  uses  a  non-linear  substitution  table  (S-Box)  that  operates  on  each  of 
the  state  bytes  independently. 

3The  State  is  a  4  x  4  array  containing  the  intermediate  state  of  the  cipher. 
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•  ShiftRows  -  This  is  a  transformation  function  that  processes  cyclically  shifts  on 
the  last  three  rows  of  a  4  x  4  State  array  of  bytes  .  The  second  row  is  shifted 
to  the  left  one,  the  third  row  is  shifted  left  by  two,  and  the  forth  row  is  shifted 
left  by  three. 

•  MixColumns  -  This  is  a  transformation  function  that  takes  all  the  columns  of 
a  4  x  4  State  array  of  bytes  and  mixes  their  data  by  multiplying  each  column 
of  the  array  by  a  fixed  polynomial. 

•  AddRoundKey  -  This  is  a  transformation  function  that  adds  a  Round  Key  to 
the  4x4  State  array  of  bytes  using  an  XOR  operation.  The  length  of  the  Round 
Key  equals  the  size  of  the  State. 

2.2.3  RSA. 

Rivest,  Shamir,  and  Adleman  (RSA),  named  after  its  inventors,  is  a  public-key 
cryptographic  algorithm  that  uses  two  keys  and  a  modulus  function  to  encrypt  and 
decrypt  messages.  RSA  is  the  most  widely  used  a  public-key  cryptosystem  [45]  for 
information  security,  but  can  also  be  used  for  digital  signatures.  A  pictorial  repre¬ 
sentation  of  the  RSA  public-key  cryptographic  algorithm  is  shown  in  Figure  5.  In 


Public  Key  (e)  Private  Key  (d) 


Figure  5.  RSA  -  Public-Key  Cryptographic  Algorithm 


RSA,  two  keys  are  generated  e  and  d  called  the  encryption  exponent  and  the  decryp- 
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tion  exponent  using  the  modulus  n.  One  key  is  kept  private  (secret)  while  the  other 
is  made  public.  RSA’s  security  is  based  on  the  intractability  if  factoring  very  large 
integers,  which  is  closely  related  to  the  discrete  logarithm  problem.  The  key  genera¬ 
tion  algorithm  is  illustrated  in  pseudocode  below  in  Algorithm  1.  Key  generation  is 
important  for  RSA  because  the  security  of  the  algorithm  is  based  completely  on  the 
“hardness”  of  factoring  a  product  of  two  large  prime  numbers. 


Algorithm  1  Key  Generation  for  RSA  Public-Key  Cryptosystem [45]  [57] 
SUMMARY:  each  entity  creates  an  RSA  public  key  and  a  corresponding  private  key. 
Each  entity  A  should  do  the  following: 

1.  Generate  two  large  random  (and  distinct)  primes  p  and  q ,  each  roughly  the  same 
size. 

2.  Compute  n  =  pq  and  <f)(n )  =  (p  —  l)(g  —  1) 

3.  Select  a  random  integer  e,  1  <  e  <  (f),  such  that  gcd (e,0(n))  =  1,  f(p,q )  (i.e.,  e 
and  (j)[n)  share  no  divisors  other  than  1). 

4.  Use  the  extended  Euclidean  algorithm  to  compute  the  unique  integer  d,  1  <  d  < 
<f)(n),  such  that  ed  =  1  (mod  4>(n )) 

5.  A’s  public  key  is  (n,  e);  A’s  private  key  is  d. 


A  brief  description  of  the  RSA  public-key  cryptographic  algorithm  is  presented  in 
Algorithm  2.  RSA  is  based  on  a  mathematical  function  to  generate  the  ciphertext. 


Algorithm  2  The  RSA  Algorithm  [45] 


SUMMARY:  B  encrypts  a  message  m  for  A,  which  A  decrypts. 


1.  Encryption.  B  should  do  the  following: 

(a)  Obtain  A’s  authentic  public  key  (n,  e). 

(b)  Represent  the  message  as  an  integer  m  in  the  interval  [0,  n  —  1]. 

(c)  Compute  c  =  me  mod.  n  (e.g.,  using  Algorithm  3). 

(d)  Send  the  ciphertext  c  to  A. 

2.  Decryption  To  recover  the  plaintext  m  from  c,  A  should  do  the  following: 
(a)  Use  the  private  key  d  to  recover  m  =  cd  mod  n. 


The  RSA  discussed  in  this  research  uses  an  implementation  based  on  modular 
exponentiation.  Modular  exponentiation  is  a  fairly  simple  Square-and-Multiply  Al- 
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gorithm  shown  in  Algorithm  3,  requiring  repetition  to  implement.  The  Right-to-Left 
design  is  described  in  Algorithm  3.  This  repetition  makes  the  operation  easy  to 
implement,  but  it  also  makes  RSA  particularly  vulnerable  to  SCA  attacks. 


Algorithm  3  Square-and-Multiply  Algorithm  for  Modular  Exponentiation 

SUMMARY :  Algorithm  for  calculating  plaintext  message  from  ciphertext  (m  =  cd  mod  n)  in  RSA. 
Where  dt  is  exponent  bits(0  <  i  <  t)  whose  binary  representation  is  d  =  Yll=o 

Set  M  <—  1  and  C  c 

if  d  =  0  then 

Return  (M) 

end  if 

for  i  —  0  up  to  t  do 

if  di  =  1  then 

M  =  M  *  C  mod  n  (Multiply  Operation) 

end  if 

C  =  C  *  C  mod  n  (Square  Operation) 

end  for 

Return  (M) 


The  key  component  in  modular  exponentiation  is  the  modular  multiplication  unit. 
Figure  6  shows  a  graphical  representation  of  a  Montgomery  Modular  Multiplication 
unit  used  in  this  research.  This  multiplier  uses  a  series  of  shift  and  add  operations 
followed  by  modulo  reductions.  The  multiplicand  is  bitwise  shifted  to  the  left  and 
added  to  the  final  product  based  on  the  Least  Significant  Bit  (LSB)  of  the  multiplier 
as  it  is  bitwise  shifted  to  the  right.  After  each  shift  and  add  operation  the  modulus 
is  subtracted  from  the  sum  either  0,  1,  or  2  times  to  get  the  final  product. 

The  basis  for  any  RSA  implementation  is  modular  exponentiation.  Given  a  mod¬ 
ular  multiplier  instantiation  the  modular  exponentiation  operation  becomes  a  simple 
repetitive  process  of  squares  and  multiplies.  However,  this  simplicity  also  makes  the 
RSA  software  and  hardware  very  vulnerable  to  attacks,  particularly  to  SCA  attacks. 
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2.3  Hardware  Obfuscation 

Obfuscation  is  typically  thought  of  as  a  technique  that  transforms  a  design  to  a 
functionally  equivalent  design,  but  resists  reverse  engineering  [28],  [66].  Hardware 
obfuscation  is  the  modification  of  circuit  hardware  to  intentionally  conceal  its  func¬ 
tionality  from  an  observer  or  adversary.  Hardware  obfuscation  is  conducted  in  an 
attempt  to  protect  Intellectual  Property  (IP)  from  being  stolen,  or  in  the  case  of 
cryptographic  systems  protection  of  data  being  passed  between  systems  [25].  There 
are  two  main  categories  to  classify  hardware  obfuscation  used  as  a  part  of  this  re¬ 
search:  security  through  obscurity  and  authentication  based  obfuscation. 

2.3.1  Security  Through  Obscurity. 

The  idea  behind  security  through  obscurity  involves  changing  the  white-box  char¬ 
acteristics  or  underlying  gate  structure  and  signals  of  the  system.  Hardware  obfusca¬ 
tion  that  creates  security  through  obscurity  can  be  executed  several  different  ways. 
The  first  technique  involves  netlist  obfuscation  [24]  while  still  maintaining  the  original 
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circuit  semantics.  This  does  not  defend  the  circuit  from  being  stolen  and  used  as  a 
“black-box”  system,  but  it  makes  it  more  difficult  for  an  attacker  to  determine  and 
modify  the  circuit  functions.  The  process  of  netlist  obfuscation  changes  depending 
upon  the  technique,  but  typically  involves  some  sort  of  smart  selection  and  replace¬ 
ment  of  gates  to  create  new  paths  within  the  design. 

The  second  technique  for  obfuscation  involves  modifying  the  circuit  hardware  by 
adding  extra  circuitry  at  the  design  level  to  intentionally  obfuscate  white-box  circuit 
functions.  This  can  be  accomplished  dynamically  by  reprogramming  without  halting 
operations  [61].  Alternately,  a  circuit  may  be  fabricated  with  gates  that  change  the 
logical  flow  of  the  Finite  State  Machine  (FSM)  and  create  multiple  paths  for  circuit 
operation.  This  allows  the  circuit  to  dynamically  change  the  way  it  operates  in  run¬ 
time.  What  if  you  use  this  logic  to  enable/ disable  circuit  operations  based  on  a  key? 

2.3.2  Authentication  Based  Obfuscation. 

Authentication  based  obfuscation  can  be  accomplished  by  inserting  a  FSM  into  the 
control  logic  of  the  function  causing  the  system  to  only  function  properly  if  provided 
the  proper  key  or  authentication  protocols.  The  authentication  protocols  require  some 
sort  of  cryptographic  algorithm  and  digital  signatures  embedded  on  the  system.  The 
system  will  only  operate  properly  if  a  proper  challenge  and  response  is  given.  These 
digital  signatures  are  only  known  to  the  creator  of  the  IP  core.  Each  system  can 
potentially  have  a  unique  authentication  signature,  which  would  require  the  storage 
and  control  of  these  signatures. 

The  idea  of  having  a  hardware  specific  identifier  that  would  not  have  to  be  centrally 
controlled  led  to  the  development  of  hardware  signatures  (digital  fingerprints)  that 
are  tied  to  a  specific  piece  of  hardware.  If  a  circuit  has  this  type  of  functionality 
and  the  hardware  is  somehow  stolcn/cloned  and  placed  on  similar  hardware  it  will 
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not  operate  properly.  These  hardware  signatures  stemmed  from  research  in  Digital 
Fingerprinting  [30],  [41],  [52]  and  Physically  Uncloneable  Functions  (PUF)  [33],  [62], 
Both  are  similar  in  nature  and  involve  the  use  of  circuitry  that  utilizes  subtle  defects 
in  the  manufacturing  process  which  cause  two  seemingly  identical  circuits  to  operate 
remarkably  differently.  These  subtle  differences  can  cause  slightly  different  delays  in 
the  circuitry  causing  glitches  that  can  be  counted  to  create  a  hardware  based  key. 

2.4  Circuit  Vulnerabilities 

MCTs  and  proprietary  information  systems  need  to  be  secured  from  potential 
attack.  There  are  several  classes  of  attacks  that  can  be  used  against  both  ASICs 
and  FPGAs  alike.  Four  classes  of  attacks  are  discussed  here:  Reverse  Engineering, 
Invasive  Techniques,  Semi-Invasive  Techniques,  and  Side  Channel  Analysis  Attacks. 
The  following  vulnerabilities  stem  from  the  common  circuit  structures  used  to  create 
FPGAs  are  also  common  to  ASICs. 

2.4.1  Reverse  Engineering. 

The  reverse  engineering  process  is  a  process  of  analyzing  a  circuit  to  identify  the 
individual  components  within  a  system  and  how  they  interact  with  each  other  in  order 
to  identify  the  circuit’s  intent.  The  ultimate  goal  of  reverse  engineering  is  to  visualize 
the  relationships  between  system  components  [27].  There  are  two  classifications  of 
reverse  engineering  black-box  analysis  and  white-box  analysis.  The  two  methods 
correspond  to  the  amount  of  information  that  is  known  about  the  circuit. 

In  black-box  analysis  little  is  known  about  the  circuit  structure.  This  method  of 
analysis  relies  on  the  input/output  relationship  (i.e.,  the  truth  table)  of  the  circuit  to 
identify  it.  Inputs  are  applied  to  a  circuit  and  the  outputs  are  recorded  to  enumerate 
the  truth  table  and  identify  the  circuit  function.  For  smaller  circuits,  the  truth  table 
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of  a  component  can  be  compared  to  a  library  directly.  For  larger  circuits,  it  is  not 
possible  to  even  enumerate  the  truth  table  let  alone  compare  it  to  a  library  [54], 

White-box  analysis  assumes  that  some  information  about  the  circuit  structure  is 
known  either  through  a  circuit  netlist  or  imagery.  Given  a  gate-level  description  of 
a  circuit  it  is  possible  to  identify  components  within  a  circuit  design  [64],  Libraries 
of  optimized  components  are  used  to  reduce  the  need  to  develop  entire  systems  from 
scratch.  These  modern  synthesis  tools  use  common  components  to  carry  out  specific 
operations  while  designers  can  focus  on  other  ares  of  development.  White  box  analysis 
exploits  the  use  of  these  common  components  and  hierarchical  design  methodologies. 

2.4.2  Invasive  Techniques. 

While  monitoring  a  circuit  to  gain  knowledge  about  it  has  been  somewhat  passive 
in  nature,  sometimes  other  techniques  may  be  required  to  discover  more  about  the 
device.  If  access  to  the  target  device  is  unrestricted  then  the  device  may  be  physically 
tampered  with.  Invasive  techniques  require  de-packaging  components  by  removing  the 
protective  packaging  around  the  circuit  under  investigation.  These  attacks  require 
specialized  tools  to  de-package  and  map  the  chips  components.  Even  though  these 
techniques  have  not  been  demonstrated  against  FPGA  technology  directly,  the  SRAM 
technology  that  FPGAs  are  built  on  have  been  shown  to  be  susceptible  to  these  types 
of  attacks  [65]. 

2.4.3  Semi-invasive  Techniques. 

Semi-invasive  techniques  are  similar  to  invasive  techniques  in  that  they  must  also 
have  the  outer  packaging  removed,  but  require  less  effort  and  expense  to  execute. 
Semi-invasive  techniques  are  categorized  in  the  area  of  fault-injections.  Fault  injection 
using  low  cost  items  like  a  camera  flash  or  laser  pointer  has  been  shown  to  temporarily 
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change  the  state  of  individual  Complementary  Metal-Oxide  Semiconductor  (CMOS) 
transistors  [59].  This  method  exploits  the  fact  that  the  CMOS  transistor  is  susceptible 
to  ionizing  radiation.  An  attacker  can  use  semi-invasive  techniques  to  flip  a  bit  in  a 
cryptographic  circuit  allowing  the  system  to  be  cracked  [19]. 

2.4.4  Side  Channel  Analysis  Attacks. 

SCA  attacks  are  used  to  non-invasively  tamper  with  an  Integrated  Circuit  (IC) 
and  attempt  to  retrieve  the  secret  key  by  analyzing  the  electrical  emissions  leaked 
from  the  circuits  switching  activity  during  normal  operation.  These  attacks  typically 
exploit  information  leaked  from  three  different  side  channels:  power  consumption 
[40],  timing  delays  [38],  and  EM  emission  [18].  The  EM  side  channel  remains  the 
most  viable  avenue  of  attacking  cryptographic  devices  when  the  power  side  channel 
is  unavailable  [18].  For  the  remainder  of  this  research  SCA  attacks  will  be  classified 
in  the  three  main  categories  for  development  in  Sections  2.5,  2.6,  and  2.7: 

•  Power  Analysis 

•  Timing  Analysis 

•  Electromagnetic  Analysis 

2.5  Power  Analysis 

Power  analysis  attacks  were  discovered  by  Kocher  [39]  and  measure  the  power 
dissipation  of  a  circuit  when  a  transistor  is  switched  from  0  to  1,  and  conversely  1  to 
0.  Power  analysis  is  conducted  by  placing  a  meter  across  a  resistor  between  power 
and  ground  of  the  circuit  to  recover  information  regarding  cryptographic  operations 
being  performed  [39].  Power  Analysis  is  divided  into  two  techniques:  Simple  Power 
Analysis  (SPA)  and  Differential  Power  Analysis  (DPA).  Both  techniques  are  based  on 
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measuring  the  amount  of  work  the  system  is  accomplishing  based  on  the  data  being 
manipulated. 

Simple  Power  Analysis  (SPA):  SPA  involves  directly  interpreting  power  signals 
collected  on  cryptographic  circuits  operations.  SPA  can  yield  information  about 
the  cryptographic  circuit’s  operations  being  conducted  and  secret  key  [39].  SPA 
is  conducted  using  only  a  single  power  trace  to  extract  the  secret  key  based  on 
conditional  logic  used  in  the  cryptographic  operations  performed. 

Differential  Power  Analysis  (DPA):  DPA  involves  the  statistical  analysis  of  the 
power  signature  to  yield  the  entire  key  or  partial  information  about  the  key. 
This  process  is  designed  to  recover  enough  information  about  cryptographic 
circuits  to  reconstruct  the  secret  key.  DPA  uses  small-scale  power  effects  based 
on  the  data  being  manipulated  during  cryptographic  operations  [39].  These 
analysis  techniques  use  a  random  plaintext  and  a  constant  key  to  correlate  the 
traces  to  the  data  being  manipulated. 

The  first  experimental  results  of  power  analysis  attacks  on  FPGAs  appeared  in 
[51]  where  Ors  et  al.  investigated  the  side  channel  of  a  hardware  implementation  of 
Elliptic  Curve  (EC)  point  multiplications.  They  conducted  SCA  to  attack  a  Mont¬ 
gomery  modular  multiplication  circuit  on  the  Xilinx  Virtex-800  FPGA  by  measuring 
the  chip  power  supply  directly.  EC  point  multiplication  is  implemented  using  a  simple 
double- and- add  method  similar  to  the  multiply-and-square  operations  conducted  in 
modular  exponentiation  (ref.  Algorithm  3),  but  have  significantly  different  side  chan¬ 
nel  signatures.  Wang  et  ah,  proposed  an  algorithm  variant  of  Kim’s  Countermeasure 
[43],  that  could  be  resistant  to  DPA  on  a  hardware  implementation  of  RSA  using  a 
Montgomery  modular  multiplier  [63].  Their  design  used  a  form  of  blinding4  within  the 
4  Blinding  involves  a  process  of  multiplying  operands  by  a  random  number 
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modular  operations  to  randomize  the  power  signatures.  Their  results  were  all  based 
on  Synopsys®VHDL  simulations  and  not  validated  on  hardware. 

FPGAs  provide  an  optimal  platform  to  design  and  test  the  SCA  vulnerabilities 
of  cryptographic  systems  prior  to  deployment.  Currently  there  are  several  FPGA 
platforms  that  enable  a  user  to  conduct  SCA  attacks  on  RSA  and  other  cryptographic 
algorithms.  One  such  system  used  by  Miyamoto  et  ah,  the  Side-channel  Atta  ck 
Standard  Evaluation  Board  (SASEBO)  [47]  [48]  [49],  is  specifically  design  to  exploit 
the  power  side  channel.  These  commercially  available  systems  rarely  give  the  user 
full  access  to  the  source  code  to  conduct  trade  studies  or  to  create  an  optimal  design 
to  meet  the  security  and  performance  needs  of  the  user.  In  addition,  these  systems 
require  the  purchase  of  specialized  hardware  to  operate.  The  creation  of  a  hardware 
non-specific  design  capable  of  being  implemented  on  any  standard  FPGA  platform 
would  be  a  perfect  platform  to  conduct  trade  studies  on  power  analysis  attacks  and 
countermeasures  available. 

2.5.1  Power  Analysis  Attacks. 

2.5. 1.1  Simple  Power  Analysis  Attacks. 

Kocher  introduced  SPA  by  implementing  a  SPA  attack  against  the  Data  En¬ 
cryption  Standard  (DES)  [39].  Kocher  analyzed  the  timing  of  branching  operations 
to  correlate  power  signatures  directly  to  cryptographic  operations  being  performed 
therefore  revealing  the  secret  key  [39].  SPA  attacks  rely  on  the  secret  key  being  used 
in  conditional  logic  operations  which  cause  branching  conditions  based  on  the  bit 
values  of  the  key. 
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2. 5. 1.2  Differential  Power  Analysis  Attacks. 


Correlation  Coefficient  Attacks:  DPA  relies  heavily  on  a  statistical  analysis  to 
conduct  attacks.  DPA  commonly  uses  the  correlation  coefficient5  to  determine 
a  linear  relationship  between  data.  The  most  common  models  used  with  corre¬ 
lation  coefficient  attacks  are  Hamming-Distance  and  Hamming- Weight  Models. 

Hamming-Distance  and  Hamming- Weight  Models:  The  Hamming-Distance  model 
uses  the  power  signature  of  cryptographic  operations  to  correlate  the  observed 
transitions  of  bit  values  within  a  register  or  on  a  data  line.  The  Hamming- 
Weight  model  correlates  the  operands  based  on  the  Hamming  Weight6  of  the 
operands. 

2.5.2  Power  Analysis  Countermeasures. 

2. 5. 2.1  Simple  Power  Analysis  Countermeasures. 

SPA  countermeasures  are  implemented  simply  by  avoiding  the  use  of  procedures 
that  use  the  secret  key  in  conditional  logic  within  cryptographic  circuitry. 

2. 5. 2. 2  Differential  Power  Analysis  Countermeasures. 

Countermeasures  to  DPA  have  been  researched  since  the  inception  of  DPA.  There 
are  countless  countermeasures  to  defend  against  DPA  and  they  are  presented  in  open 
literature.  Yet,  no  countermeasure  currently  exists  that  completely  protects  a  cir¬ 
cuit  from  DPA.  There  are  several  classes  of  DPA  countermeasures  that  are  common 
throughout  the  community: 

Signal  Leakage  Reduction:  This  class  of  countermeasures  involve  signal  reduc- 

5The  Correlation  Coefficient  is  is  a  guess  at  the  covariance  between  two  random  variables. 

6The  Hamming  Weight  of  a  binary  number  represents  the  number  of  1  bits  within  the  data 
sequence. 
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tion.  This  is  accomplished  by  either  balancing  the  hamming  weight  of  the  data 
or  reducing  the  secret  key  dependence  on  conditional  logic  functions. 

Noise  Induction:  This  class  of  countermeasures  involves  inducing  noise  generators, 
such  as  oscillators,  into  the  power  consumption  to  effectively  mask  the  original 
circuit’s  power.  These  added  signals  increase  the  number  of  signals  required  to 
conduct  an  attack.  Ultimately  we  would  like  to  increase  the  required  samples 
to  an  unmanageable  size. 

Timing  Randomization:  This  class  of  countermeasure  involves  randomizing  circuit 
execution  timing.  This  class  is  similar  to  a  technique  implemented  by  Kocher 
[39].  Timing  randomization  is  accomplished  by  inserting  random  delays  or  clock 
modifications  [44], 

Kaps  et  al.,  proposed  a  countermeasure  for  DPA  attacks  on  FPGA  based  AES 
implementations  by  using  Dynamic  Differential  Logic  (DDL)  equalizing  power  con¬ 
sumption  to  increase  the  number  of  traces  (i.e.  encryption  cycles)  needed  [36].  This 
technique  worked,  but  at  a  large  area  penalty. 

2.6  Timing  Analysis 

Timing  analysis  attacks  exploit  the  slight  differences  in  the  amount  of  time  re¬ 
quired  to  perform  encryption/decryption  operations  [38],  [40].  These  slight  differences 
in  encryption  operations  can  leak  information  about  the  secret  key  being  used.  By 
carefully  measuring  the  time  required  for  a  cryptographic  system  to  conduct  pri¬ 
vate  key  operations  in  algorithms  like  Diffie-Hellman  and  RSA,  attackers  are  able  to 
extract  the  private  key  [40]. 


27 


2.6.1  Timing  Analysis  Attacks. 


These  algorithms  are  particularly  vulnerable  in  the  modular  exponentiation  por¬ 
tion  of  the  cryptographic  algorithms  (see  Algorithm  3).  In  modular  exponentiation 
each  bit  of  the  exponent  (private  key)  bits  d,  are  used  to  determine  whether  a  square 
operation  (dt  =  0),  or  a  square  and  multiply  operations  (dt  =  1)  will  be  performed. 
The  goal  of  an  attacker  would  be  to  monitor  the  loop  iterations  of  the  execution 
cycle  timing  to  extrapolate  the  private  key.  Circuits  are  especially  vulnerable  when 
operations  are  conducted  in  a  serial  manner  such  as  in  software  implementations  of 
cryptographic  systems  because  they  typically  execute  one  operation  at  a  time.  These 
attacks  require  a  working  knowledge  of  how  the  system  is  implemented  in  order  to 
conduct  a  successful  attack. 

A  more  practical  method  of  attack  was  detailed  in  [31]  that  does  not  require  as 
much  knowledge  of  the  system  as  those  presented  by  Kocher.  This  attack  used  a 
Montgomery  multiplier  unit  that  used  a  constant  timing  regardless  of  input  factors 
plus  a  reduction  phase  for  the  modulus.  This  allowed  them  to  create  templates  of 
the  different  plaintext  input  combinations  that  did  and  did  not  require  reductions  for 
each  of  the  loops. 

Messerges  et  al.  [46]  conducted  SPA  on  implementation  of  modular  exponentiation 
on  a  smart  card.  Using  this  implementation  Messerges  was  unable  to  cross-correlate 
differences  in  the  square  and  multiply  power  signatures,  but  was  able  to  identify  a 
slightly  different  execution  time  needed  for  each  operation  leading  to  a  combined 
power  and  timing  attack.  The  results  of  the  attack  are  shown  in  Figure  7. 

2.6.2  Timing  Attack  Countermeasures. 

Kocher  suggested  that  an  obvious  countermeasure  to  timing  attacks  is  to  force  all 
operations  to  take  the  same  amount  of  time  [40].  However  forcing  software  implc- 
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Figure  7.  Cross-Correlation  of  Modular  Exponentiation  on  a  Smart  Card  [46] 

mentations  to  run  in  constant  time  is  difficult  in  a  platform- independent  environment 
because  of  compiler  options  [40].  Forcing  a  constant  operating  time  can  be  easier 
to  implement  using  hardware,  but  then  the  circuit  will  still  leak  information  on  the 
power  channel  based  on  switching  activity. 

Kocher  [38]  proposed  that  techniques  for  blinding  signatures  [26]  could  be  adapted 
to  Diffi  e-Hcllman  and  RSA  to  prevent  modular  operations  from  being  directly  related 
to  the  secret  key.  Blinding  the  exponent  can  be  used  to  defend  against  SCA  attack 
and  can  be  applied  to  either  the  message  and/or  exponent.  For  RSA,  Kocher  proposed 
that  a  random  pair  (■ )  could  be  added  to  the  input  message  before  the  modular 
exponentiation  operations.  You  first  choose  a  random  Vf  relatively  prime  to  n  and 
Vi  =  (v]l)e  mod  n.  The  input  message  is  then  multiplied  by  v%  mod  n,  and  the  result 
is  then  recovered  by  multiplying  Vf  mod  n. 

2.7  Electromagnetic  Analysis 

Electromagnetic  Analysis  (EMA)  attacks  are  similar  to  power  analysis  except  it 
does  not  require  physical  tampering  of  the  device  under  attack.  EMA  exploits  the 
TEMPEST  (i.e. ,  EM  radiation)  information  leaked  by  electronic  devices  commonly 
called  compromising  emanations  [55].  By  using  technology  like  inductive  probes,  the 
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adversary  uses  passive  analysis  to  collect  circuit  power  information.  The  EM  side 
channel  was  proposed  not  long  after  power  analysis  and  later  was  demonstrated  on 
smart  cards  by  Quisquater  and  Samyden  [55].  Just  like  power  analysis,  there  are 
two  techniques  for  EMA  introduced  by  Quisquater:  Simple  Electromagnetic  Analysis 
(SEMA)  and  Differential  Electromagnetic  Analysis  (DEMA). 

Simple  Electromagnetic  Analysis  (SEMA):  This  type  of  analysis  involves  the 
same  processes  as  SPA  except  that  it  uses  a  different  side  channel  to  execute.  A 
single  trace  is  collected  on  a  cryptographic  operation  to  extract  the  secret  key. 

Differential  Electromagnetic  Analysis  (DEMA):  This  type  of  analysis  involves 
the  same  processes  as  DPA.  Many  traces  are  collected  using  random  plaintext 
input  and  a  constant  key.  Traces  are  then  correlated  and  analyzed  to  extract 
the  secret  key. 

EMA  was  first  conducted  on  smart  cards  by  Quisquater  using  a  small  coil  of  wire 
connected  to  an  oscilloscope  to  measure  the  current  [55].  Carlier  et  ah,  demonstrated 
that  the  secret  key  could  be  revealed  by  using  DEMA  on  a  FPGA  implementation 
of  AES.  Carlier  also  found  that  the  EM  side  channel  emanations  of  an  FPGA  are 
of  the  same  nature  as  those  from  a  smart  card  [23].  They  attacked  the  hardware 
implementation  of  AES  successfully  showing  that  the  EM  side  channels  of  the  FPGA 
can  be  exploited  by  an  attacker  to  retrieve  information  about  the  secret  key.  Mulder 
et  ah,  conducted  SEMA  and  DEMA  attack  on  an  FPGA  implementation  of  an  Elliptic 
Curve  Cryptosystem  [50]  executing  EC  point  multiplication. 

2.7.1  EMA  Attacks. 

Attacks  conducted  on  cryptographic  circuits  using  EMA  are  conducted  in  the 
same  manner  as  power  analysis  regardless  of  whether  you  are  conducting  SEMA  or 
DEMA.  The  only  difference  is  the  method  of  collection. 
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2.7.2  EMA  Countermeasures. 


The  first  commonplace  countermeasure  for  EMA  is  the  use  of  additional  metal 
layers  to  reduce  the  electromagnetic  fields  on  the  chip  [55].  Another  countermeasure 
to  EMA  is  to  design  the  circuit  for  lower  power  consumption.  Noise  generators,  such 
as  oscillators,  can  be  easily  added  to  the  circuitry  to  generate  additional  signals  in  an 
attempt  to  make  EMA  more  difficult. 

2.8  Literature  Review  Summary 

Reconfigurable  hardware  simplifies  and  reduces  the  cost  of  the  design  process  by 
allowing  the  customer  to  rapidly  iterate  through  prototype  designs.  FPGAs  have 
been  found  to  be  vulnerable  to  the  same  methods  of  attack  as  ASIC  circuitry,  but 
one  benefit  FPGAs  have  is  the  ability  to  test  possible  design  iterations  before  locking 
in  a  final  design.  The  AES  and  RSA  cryptographic  algorithms  were  presented  as  a 
basis  for  cryptographic  circuits.  Finally,  several  common  side  channel  analysis  attacks 
were  presented  along  with  possible  countermeasures  designed  increase  the  required 
traces  to  an  unmanageable  size. 
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III.  Methodology  of  the  Side  Channel  Analysis  “CLOAK” 

Countermeasure 


This  chapter  describes  the  methodology  used  to  research/create  a  flexible  dy¬ 
namic  protection  for  encryption  circuits  against  Side  Channel  Analysis  (SCA)  attacks 
by  creating  the  Critically  Low  Observable  Anti- Tamper  Keeping  Circuit  (CLOAK) 
countermeasure . 

3.1  Problem  Definition 

Current  public  key  encryption  systems  do  not  have  adequate  protections  against 
side  channel  analysis  attacks.  Cryptographic  circuits  are  particularly  susceptible  to 
EM  and  power  SCA  attacks.  RSA  [56]  public  key  cryptographic  algorithm  is  espe¬ 
cially  vulnerable  to  SCA  attacks  by  timing  of  multiplication  and  squaring  operations 
used  during  modular  exponentiation  of  the  private-key  operations.  Timing  attack  vul¬ 
nerabilities  affect  many  current  public  key  encryption  algorithms  [38].  While  static 
defensive  countermeasures  have  been  incorporated  in  more  recent  designs,  they  do 
not  address  all  the  techniques  known  to  be  used. 

The  key  vulnerability  a  SCA  attack  exploits  in  cryptographic  algorithms  like  RSA 
is  the  predictable  time  required  to  calculate  the  private  key  operation  using  modular 
exponentiation.  Timing  attacks  can  be  conducted  by  physically  connecting  a  meter 
between  power  and  ground  on  the  chip  or  by  non  physical  means  to  measure  the  EM 
radiation  leaked  by  the  circuit. 

3.1.1  Goals  and  Hypothesis. 

The  goal  of  this  research  is  to  determine  whether  a  polymorphic  circuit  can  protect 
a  device  from  an  EM  timing  analysis  attack.  The  hypothesis  of  this  research  is  that 
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the  EM  signature  of  a  cryptographic  algorithm  can  be  varied  in  such  a  way  that  the 
observer /adversary  cannot  correlate  side  channel  signature  to  the  cryptographic  func¬ 
tions  being  executed.  The  approach  to  accomplish  this  is  discussed  in  the  following 
sections. 

3.1.2  Research  Approach. 

The  desired  outcome  of  this  research  is  the  creation  and  analysis  of  a  Critically  Low 
Observable  Anti- Tamper  Keeping  Circuit  (CLOAK)  countermeasure  that  provides 
flexible  dynamic  protection  for  encryption  circuits  against  SCA  attacks.  This  research 
is  accomplished  in  three  parts  the  polymorphic  circuit  design,  implementation  of  RSA 
with  reconfiguration,  and  analysis  of  the  side  channel  signatures  all  with  the  ultimate 
goal  of  protecting  the  secret  key. 

3. 1.2.1  Polymorphic  Circuit  Design. 

The  key  to  providing  a  secure  cryptographic  system  is  the  protection  of  the  secret 
key  from  the  adversary.  This  research  leverages  and  extends  previous  work  by  Porter 
[54]  [53]  and  Cady  [22]  by  using  polymorphic  circuits.  Polymorphic  reconfiguration 
is  a  combination  of  polymorphic  gates  or  decision  circuits  to  create  a  dynamically 
changing  circuit  design  that  is  capable  of  changing  the  way  they  function  based  on  an 
input  key.  A  polymorphic  circuit  design  is  developed  and  implemented  on  a  FPGA. 
This  polymorphic  circuit  design  is  capable  of  continuously  changing  the  way  it  func¬ 
tions  to  obfuscate  its  side  channel  signature  in  both  power  consumption  and  timing. 
This  effectively  creates  the  CLOAK  countermeasure  for  hardware  obfuscation.  The 
EM  side  channel  of  the  circuit  is  then  characterized  to  insure  proper  functionality 
by  implementing  it  into  the  modular  exponentiation  circuitry  of  a  hardware  imple¬ 
mentation  of  RSA.  The  entire  polymorphic  circuit  development  process  using  block 
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diagrams  is  shown  in  Figure  8. 
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(d)  Randomized  Power  RSA  Trace 
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(e)  Randomized  Power  and  Timing  (CLOAKe d)  RSA  Trace 

Figure  8.  Development  of  the  Polymorphic  Circuit  Design  using  Block  Diagrams 


The  polymorphic  circuit  development  is  broken  down  into  three  primary  steps: 

1.  Flatten  Power  Signature:  The  original  EM  trace  for  RSA  is  shown  in  Figure 
8(a)  where  the  square  and  multiply  operations,  shown  with  “S”  and  “M”  blocks, 
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are  executed  at  the  same  time.  Flattening  the  power  involved  two  iterations. 
The  first  iteration  shown  in  8(b)  separated  the  square  and  multiply  operations 
in  time  and  became  the  baseline  version  of  RSA.  The  second  iteration  shown  in 
8(c)  reduced  the  two  modular  multiplier  instantiations  down  to  one  instantiation 
for  both  multiply  and  squaring  operations  to  become  the  level  power  version  of 
RSA. 

2.  Randomize  Power  Signature:  Once  the  circuit  power  consumption  was  rel¬ 
atively  level  the  power  consumption  within  each  multiplier  operation  was  ran¬ 
domized  by  iterating  through  multiple  adders  within  each  modular  multiplier 
instantiation.  The  random  power  version  of  RSA  is  shown  in  Figure  8(d). 

3.  Randomize  Circuit  Timing:  The  third  and  final  step  in  the  polymorphic 
circuit  development  randomized  the  circuit  power  consumption  and  timing  in 
order  to  create  the  final  CL  OAK e d  RSA  design,  shown  in  Figure  8(e). 

3. 1.2. 2  Implement  RSA  with  Reconfiguration. 

An  RSA  implementation  is  also  designed  and  implemented  on  an  FPGA  using 
a  structural  approach,  which  is  a  design  based  on  the  components  and  their  inter¬ 
connects.  This  design  approach  gives  the  developer  more  control  over  the  implemen¬ 
tation  of  the  modular  exponentiation  portion  of  the  cryptographic  algorithm.  The 
first  implementation  of  RSA  uses  unobfuscated  modular  exponentiation  circuitry  for 
baseline  analysis.  The  second  implementation  uses  the  polymorphic  circuit  design  for 
the  modular  exponentiation  circuitry. 

3. 1.2. 3  Analyze  Side  Channel  Signatures. 

The  EM  side  channel  of  the  original  RSA  implementation  is  characterized  to  obtain 
a  baseline  for  subsequent  analysis  of  the  side  channel.  The  baseline  is  compared  to  the 
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modified  RSA  implementation  to  characterize  the  extent  the  circuit  is  able  to  resist 
timing  attacks  and  protect  the  secret  key.  It  is  expected  that  the  combination  of 
elements  in  this  approach  culminate  in  a  proof-of-concept  polymorphic  circuit  design 
that  enhances  system  protective  countermeasures  by  obfuscating  operations. 


3.2  System  Boundaries 


The  System  Under  Test  (SUT)  is  the  Polymorphic  CLOAK  Encryption  System 
(CES)  as  shown  in  Figure  9.  The  Polymorphic  CES  consists  of  four  primary  compo¬ 
nents:  RSA  encryption  algorithm,  an  FPGA,  a  processor,  and  the  CLOAK  counter¬ 
measure. 
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Figure  9.  Polymorphic  CLOAK  Encryption  System 


The  Component  Under  Test  (CUT)  is  the  CLOAK  countermeasure  integrated  into 
a  hardware  implementation  of  the  RSA  cryptographic  algorithm  on  a  Xilinx  Virtex-5 
FPGA.  The  system  also  contains  a  PowerPC®440  Processor  Core  to  handle  all  the 
data  I/O  interfacing  for  the  system.  These  components  are  discussed  in  more  detail 
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below. 


3.2.1  RSA  Encryption  Algorithm. 

The  RSA  encryption  algorithm  is  a  public-key  cryptographic  algorithm  that  uses 
two  keys  and  a  modulus  to  encrypt  and  decrypt  messages.  This  algorithm  is  imple¬ 
mented  using  a  structural  hardware  implementation  programmed  on  the  FPGA. 

3.2.2  Xilinx  Virtex-5  and  Virtex-6  FPGAs. 

The  Xilinx®Virtex-5  and  Virtex-6  FPGAs  consist  of  an  array  of  reprogrammable 
gates  capable  of  being  reprogrammed  by  the  user.  The  configuration  process  starts  by 
defining  the  Very-High-Speed  Integrated  Circuit  (VHSIC)  hardware  circuit  using  the 
VHSIC  Hardware  Description  Language  (VHDL)  [2],  The  proprietary  Xilinx®software 
transforms  the  users  VHDL  circuit  description  into  specifications  into  a  bitstream  to 
be  downloaded  and  programmed  on  the  chip. 

3.2.3  Processor  Core. 

The  PowerPC  440  processor  core  is  a  32-bit  processor  [14]  that  serves  as  the  central 
control  unit  for  the  system  implemented  on  the  Virtex-5.  The  MicroBlaze  soft-core1 
processor  core  is  also  a  32-bit  processor  that  serves  as  the  central  control  unit  for  the 
system  implemented  on  the  Virtex-6.  Both  processor  cores  handle  all  the  data  I/O 
interfacing  for  the  SLIT  for  each  FPGA  respectively. 

3.2.4  CLOAK  Countermeasure. 

The  CLOAK  countermeasure  is  a  polymorphic  circuit  design  implemented  in  the 
modular  exponentiation  component  within  the  RSA  encryption  algorithm.  This  poly- 
1A  soft-core  processor  is  implemented  using  VHDL  on  an  FPGA 
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morphic  circuit  design  continuously  adjusts  the  way  in  which  the  circuit  performs  its 
operations  to  obfuscate  the  EM  side  channel  from  the  adversary. 

3.3  System  Services 

The  Polymorphic  CLOAK  Encryption  System  provides  two  services. 

RSA  Block  Encryption  Service:  The  block  encryption  service  either  encrypts 
the  message  or  not. 

Obfuscation  of  (CLOAK)  EM  Side  Channel:  The  obfuscation  service  either 
works  or  not,  and  any  amount  of  partial  obfuscation  is  considered  a  success¬ 
ful  obfuscation.  The  amount  of  partial  obfuscation  depends  on  the  increased 
number  of  traces  required  to  successfully  retrieve  the  key. 

3.4  Workload  Parameters 

The  workload  parameters  of  the  system  characterize  the  requests  for  service  to 
the  system.  These  parameters  describe  the  quantity  and  characteristics  of  the  data 
the  system  operates  on. 

3.4.1  Message  Offered  Load. 

Since  the  system  encrypts  messages  to  ciphertext,  the  number  of  messages  offered 
to  the  system  drives  the  system  workload.  An  increase  in  the  number  of  messages 
is  directly  proportional  to  the  workload  and  the  output  ciphertext  of  the  system. 
This  trend  holds  until  the  capacity  of  the  system  is  reached.  For  the  experiment,  the 
number  of  messages  offered  to  the  system  is  limited  to  the  number  of  traces  required 
in  a  collection.  Initially  this  value  is  set  to  a  1,000  since  that  is  sufficient  for  the  SCA 
software. 


38 


3.4.2  Random  versus  Static  Message. 


If  the  system  is  using  a  static  message,  the  system  only  has  to  receive  and  store 
the  message  in  memory  once  per  data  collection.  If  the  system  is  using  a  random 
message  each  encryption  cycle,  the  workload  of  the  system  is  reduced  by  the  amount 
of  time  it  takes  to  transfer  and  store  the  message  to  memory.  For  the  experiment,  a 
random  message  is  used  each  encryption  cycle. 

3.4.3  Encryption  Key  Length. 

The  encryption  key  length  changes  the  number  of  operations  required  to  complete 
an  encryption  cycle.  A  longer  key  requires  more  modular  exponentiation  operations 
to  produce  the  resulting  ciphertext.  An  increase  in  this  workload  parameter  would 
decrease  the  amount  of  data  the  system  can  operate  on  by  increasing  the  encryption 
circuit  execution  time.  For  this  experiment,  the  encryption  key  length  is  set  to  512 
bits.  A  key  length  of  512-bits  gives  the  system  sufficient  workload  and  security,  while 
still  being  able  to  fit  on  the  hardware. 

3.4.4  Polymorphic  Key  Length. 

The  polymorphic  key  length  determines  the  circuit  change  frequency  for  the  poly¬ 
morphic  circuit  function.  Ultimately  the  circuit  should  pseudo  randomly  change  its 
function  each  time  it  is  used.  That  is  to  say  that  the  encryption  circuit  should  not 
only  vary  function  timing  between  the  polymorphic  function  used,  but  each  itera¬ 
tion  of  the  encryption  algorithm  should  be  seeded  randomly.  This  randomness  in 
the  execution  time  changes  the  amount  of  data  the  system  may  operate  on.  For  this 
experiment,  the  polymorphic  key  length  is  set  to  64  bits. 
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3.5  Performance  Metrics 


The  SUT  is  evaluated  based  on  FPGA  area  overhead  for  polymorphic  logic,  mod¬ 
ular  exponentiation  timing  delay  variation,  encryption  circuit  timing  delay /latency, 
and  number  of  traces. 

3.5.1  FPGA  Area  Overhead  for  Polymorphic  Logic. 

In  every  system,  area  is  a  limited  commodity  and  should  not  be  wasted.  On  a 
typical  ASIC,  an  increase  in  circuit  area  equates  to  an  increase  in  chip  area,  and 
therefore,  increased  cost.  On  an  FPGA,  the  available  programmable  space  is  defined 
by  the  manufacturer.  To  maximize  the  functionality  of  a  system  on  a  given  FPGA, 
care  should  be  taken  not  to  use  all  the  available  space  for  a  given  circuit.  The  cryp¬ 
tographic  circuitry  is  implemented  with  and  without  the  CLOAK  countermeasure 
to  determine  the  increased  area  usage  for  the  polymorphic  logic.  System  utilization 
metrics  are  reported  by  the  proprietary  Xilinx  software.  This  metric  highlights  the 
feasibility  of  implementing  Polymorphic  CES  in  embedded  or  space-limited  applica¬ 
tions. 

3.5.2  Modular  Exponentiation  Timing  Delay  Variation. 

The  key  to  obfuscating  the  modular  exponentiation  function  of  RSA  is  to  vary 
the  timing  of  the  square  and  multiply  functions  in  such  a  way  that  an  observer 
cannot  determine  which  function  is  being  executed.  The  randomness  and  frequency 
of  the  timing  variations  within  modular  exponentiation  functions  is  measured  from 
the  time  the  data  is  available  to  when  the  result  is  produced  for  each  operation.  These 
results  are  compared  to  timing  characteristics  of  normal  circuit  operations.  In  stages 
of  modular  exponentiation  where  the  circuit  executes  only  a  square  operation  or  a 
square  and  multiply  operation,  an  observer  gains  insight  into  determining  the  key. 
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The  timing  delays  for  both  square  and  multiply  are  varied  in  such  a  way  that  an 
observer  cannot  determine  which  one  is  being  executed.  This  metric  highlights  the 
ability  of  the  system  to  obfuscate  its  EM  side  channel  signature  to  reduce  the  amount 
of  information  the  adversary  can  determine  about  the  system. 

3.5.3  Encryption  Circuit  Timing  Delay /Latency. 

For  any  encryption  system  the  major  bottleneck  is  the  speed  of  encryptions. 
Therefore,  circuit  delay/latency  is  measured.  The  delay  is  defined  to  be  the  mo¬ 
ment  the  key  and  message  are  available  to  the  encryption  circuit  to  the  moment  the 
ciphertext  becomes  available.  The  critical  path  timing  delay  of  the  encryption  circuit 
is  measured  using  two  methods.  First,  the  maximum  delay  for  the  encryption  cir¬ 
cuitry  is  estimated  and  reported  by  the  Xilinx  software  upon  implementation  on  the 
chip.  Second,  the  EM  side  channel  is  monitored  using  an  oscilloscope  triggered  by  a 
signal  indicating  the  beginning  and  end  of  the  encryption  cycle.  This  metric  reveals 
the  time  required  for  an  encryption  cycle. 

3.5.4  Number  of  Traces. 

SCA  software,  whether  it  be  Inspector®or  Matlab®,  require  traces  to  characterize 
the  systems  side  channel  signature.  A  trace  is  the  EM  or  power  waveform  collected 
by  the  oscilloscope  and  passed  to  the  SCA  software  for  collection  and  analysis.  Once 
this  side  channel  signature  is  determined  the  key  can  then  be  extracted.  The  amount 
of  information  an  observer  can  extract  about  circuit  operations  increases  with  the 
number  and  quality  of  traces  collected.  The  number  of  traces  required  to  determine 
the  key  is  an  indication  of  the  level  of  circuit  obfuscation. 
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3.6  System  Parameters 


The  system  parameters  are  the  characteristics  of  the  system  that  if  changed  affects 
the  metrics  or  the  response  of  the  system. 

3.6.1  Background  Noise. 

The  emanations  given  off  by  electronic  equipment  within  the  range  of  the  SUT 
(hereafter  referred  to  as  “Background  Noise”)  can  affect  the  EM  readings.  Some  of  the 
background  noise,  such  as  power  lines  or  clock  circuits,  occur  at  regular  frequencies 
and  can  be  identified  and  filtered  out  of  the  resulting  traces.  Background  noise  sources 
are  identified  and  limited  to  reduce  their  effect  on  the  SUT. 

3.6.2  Implementation  of  RSA. 

The  RSA  public  key  algorithm  is  implemented  on  a  hardware  FPGA  using  struc¬ 
tural  rather  than  behavioral  VHDL.  A  structural  approach  represents  the  system 
design  in  terms  of  its  components  and  interconnections  which  allows  more  control 
over  implementation  and  how  the  system  will  respond.  In  addition,  it  enables  the 
polymorphic  circuit  design  to  be  integrated  into  the  overall  implementation  of  RSA. 
The  characteristics  of  the  EM  side  channel  differs  greatly  depending  upon  the  imple¬ 
mentation  method  used. 

3.6.3  Polymorphic  Frequency. 

A  frequency  of  the  circuit  changes  is  controlled  by  the  length  of  the  polymorphic 
key.  The  polymorphic  frequency  changes  the  timing  of  the  modular  exponentiation 
function  in  a  pseudo  random  manner.  In  turn,  this  will  change  the  system  response 
and  effectively  CLOAK  the  function  from  SCA. 
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3.6.4  Xilinx  FPGAs. 


The  Xilinx  Virtex-5  FPGA  is  used  to  implement  the  hardware  portion  of  the 
RSA  algorithm.  The  FPGA  allows  rapid  prototyping  and  testing  of  the  system. 
Each  version  and  family  of  FPGAs  have  slightly  different  EM  side  channel  signature 
and  timing  characteristics.  The  Virtex-5  chip  package  consists  of  a  flip-chip  design,  as 
shown  in  Figure  10,  can  also  affect  the  EM  Side  channel  signature.  These  differences 
may  change  the  system  speed,  power  consumption,  or  area,  which  may  also  affect  the 
system  response  to  inputs. 
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Figure  10.  Cross  Section  for  Virtex-5  Flip-Chip  Package  [11] 


The  Xilinx  Virtex-6  FPGA  is  also  used  after  initial  development  to  characterize 
the  differences  in  side  channel  between  the  FPGA  families.  The  Virtex-6  chip  package 
also  consists  of  a  similar  flip-chip  design  as  the  Virtex-5.  However,  the  Virtex-6  is 
built  using  a  smaller  40 nm  process  technology  enabling  decreased  power  consumption. 

3.6.5  Processor  Core. 

The  PowerPC  440  processor  core  is  embedded  in  the  Virtex-5  FPGA  and  provides 
data  I/O  and  controls  how  the  system  will  operate  and  respond  to  inputs.  The 
processor  core  also  contributes  to  the  background  EM  noise  of  the  system.  The  Virtex- 
6  FPGA  does  not  contain  an  integrated  PowerPC  processor  core  so  a  MicroBlaze 
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soft-core  processor  core  will  be  used. 


3.7  Factors 

Table  1  contains  a  summary  of  the  factors  and  levels  used  in  this  methodology. 
Each  of  these  factors  are  described  in  detail  in  the  following  sections. 

Table  1.  Factors  and  Levels 


Factors 

Levels 

Polymorphic  Frequency 

None/Full 

FPGA  Version  for  RSA 

Virtex-5/ Virtex-6 

Random  vs.  Static  Message 

Random/Static 

3.7.1  Polymorphic  Frequency. 

The  ability  of  the  system  to  adjust  its  function  timing  establishes  the  system’s  abil¬ 
ity  to  obfuscate  the  circuit  function  from  the  adversary.  This  research  uses  two  levels 
of  polymorphism  none  and  full  polymorphism,  where  full  polymorphism  is  changing 
the  circuit  function  for  each  step  of  the  modular  exponentiation  algorithm.  The  ex¬ 
istence  of  timing  variations  and  the  ability  to  obfuscate  the  circuit  are  dependent  on 
each  other  and  are  said  to  interact  with  each  other. 

3.7.2  FPGA  Version  for  RSA. 

RSA  can  be  implemented  on  any  FPGA  with  suitable  resources  available  to  fit 
the  design.  Each  hardware  version  will  create  a  different  EM  side  channel  signature. 
The  FPGA  version  is  varied  to  determine  its  impact  on  a  successful  attack.  The 
hardware  implementations  of  RSA  uses  VHDL  to  define  its  structure.  The  FPGA  is 
varied  between  Virtex-5  and  Virtex-6  designs. 
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3.7.3  Random  versus  Static  Message. 


The  message  is  varied  between  static  (fixed  message)  and  random  (varied  message) 
to  determine  an  impact  on  an  attack.  A  static  versus  random  message  may  change 
the  side  channel  signature  by  changing  the  loading  on  the  encryption  circuit.  Varying 
message  content  increases  the  difficulty  of  characterizing  the  side  channel  signature 
and  therefore  the  ability  to  attack  the  system. 

3.8  Evaluation  Technique 

The  evaluation  technique  used  is  measurement  on  real  hardware  since  a  simulation 
can  only  get  you  so  close  to  actual  values.  A  real  system  can  more  accurately  evaluate 
the  functions  of  a  circuit.  Figure  11  shows  the  experimental  configuration.  The  system 
is  developed  and  tested  on  a  Xilinx  Virtex-5  board.  Measurements  (traces)  are  taken 
using  an  Lecroy  WavePro  725Zi  oscilloscope  with  a  Willtek®1207  Inductive  EM  probe 
and  triggered  by  pulse  signal  from  the  SUT.  EM  traces  from  the  Polymorphic  CES 
are  compiled  and  analyzed  using  Riscure’s  Inspector  SCA  Test  Software  version  4.1.1. 
The  SUT  will  receive  the  message,  key,  and  control  signals  from  and  provide  ciphertext 
to  the  Inspector  software  through  the  serial  port. 

Xilinx  ML507 

Riscure  Inspector  Evaluation  Platform  O-Scope  w/  EM  Probe 

Serial 


Figure  11.  Experimental  System  Configuration 

Section  3.1.2  discusses  the  baseline  traces  collected  with  the  Inspector  software 
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throughout  the  design  process.  The  traces  collected  from  the  obfuscated  circuit  design 
are  validated  against  the  baseline  traces  to  ensure  proper  functionality.  Inspector’s 
cryptographic  analysis  software  is  also  used  to  analyze  the  circuit’s  ability  to  obfuscate 
its  function  by  comparing  the  number  of  traces  needed  for  baseline  tests. 

The  effectiveness  of  polymorphic  reconfiguration  in  the  modular  exponentiation 
design  will  be  evaluated  in  three  parts,  the  first  being  the  verification  of  the  individ¬ 
ual  polymorphic  subcircuit  side  channel  signatures.  Secondly,  the  effect  on  the  side 
channel  signature  of  the  polymorphic  modular  exponentiation  operations  compared 
to  the  baseline  signatures.  Thirdly,  characterize  the  RSA  side  channel  vulnerability 
signature  of  baseline  RSA  circuit.  Finally,  a  practical  case  study  implementation  of 
the  RSA  public  key  cryptography  algorithm  using  the  polymorphic  modular  expo¬ 
nentiation  component.  The  modular  exponentiation  component  of  RSA  is  chosen  as 
the  implementation  test  bench  due  to  its  vulnerability  to  timing  attacks  using  the 
EM  side  channel.  Evaluation  is  based  on  how  many  traces  are  required  to  recover  the 
secret  key  and  how  the  results  compare  to  the  baseline  circuit. 

3.9  Experimental  Design 

A  full  factorial  design  is  conducted.  There  are  3  factors  with  2  levels  each  (i.e. 
2k,  where  k  —  3)  resulting  in  8  experiments.  Two  replications  of  each  experiment  are 
conducted  for  a  total  of  16  experiments.  A  95%  confidence  level  is  used  for  subsequent 
analysis  of  the  data.  This  is  sufficient  to  show  that  the  Polymorphic  CES  can  protect 
a  cryptographic  circuit  from  EMA  attacks. 

3.10  Methodology  Summary 

This  chapter  defines  the  experimental  methodology  for  the  Polymorphic  CES. 
The  goal  of  this  research  is  to  determine  the  effectiveness  of  the  polymorphic  CLOAK 
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circuit  as  a  countermeasure  to  EM  SCA  attacks  on  RSA.  This  research  focuses  on  the 
timing  vulnerabilities  in  the  modular  exponentiation  portions  of  the  RSA  algorithm 
to  secure  the  secret  key  from  possible  attack.  The  SUT  is  defined  and  bounded  by 
clearly  defining  its  components.  The  system  services  are  listed  with  possible  outcomes. 
Performance  metrics  are  defined  based  on  the  system  services.  System  and  workload 
parameters  are  identified  and  associated  with  their  sensitivity  to  the  SUT.  A  list  of 
factors  and  their  levels  are  chosen  from  the  selected  parameters.  The  experimental 
design  measures  the  EM  side  channel  signature  using  an  oscilloscope  connected  to 
the  Inspector  SCA  software  to  collect  and  analyze  traces.  These  traces  are  validated 
against  baseline  measurements  to  evaluate  the  effectiveness  of  the  SUT.  The  factors 
list  is  compiled  into  a  full  factorial  design  of  a  real  system  resulting  in  16  experiments. 


47 


IV.  Results 


This  chapter  first  discusses  the  experimental  setup  and  the  methodology  used  for 
attacking  the  system  in  Sections  4.1  and  4.2.  The  chapter  continues  in  Section  4.3 
and  4.4  by  progressively  describing  the  development  of  the  RSA  Encryption  System 
and  polymorphic  circuit  within  the  design.  Results  are  given  throughout  the  process 
of  development  and  final  results  for  the  Polymorphic  CES  are  given  in  Section  4.5. 
Section  4.6  expands  the  research  by  adapting  the  Polymorphic  CES  design  to  run 
on  the  Virtex-6  board  for  comparison.  Finally,  all  the  designs  are  compared  to  each 
other  based  on  running  time  and  size  metrics  in  Section  4.7.  These  designs  are  also 
compared  to  comparable  industry  implementations  currently  available. 


4.1  Experimental  Setup 


Figure  12.  Experimental  System  Setup 


Figure  12  depicts  the  hardware  realization  of  the  system  design  depicted  in  the 
experimental  system  configuration  shown  in  Figure  11.  On  the  left  is  the  Lecroy 
oscilloscope,  in  the  middle  is  the  Inspector  software  used  for  side  channel  analysis, 
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and  on  the  right  is  the  Electromagnetic  (EM)  probe  measuring  the  unintended  EM 
emissions  from  the  SUT. 

Figure  13  shows  a  close  up  view  of  the  FPGA  with  an  inductive  probe  placed 
on  top  of  the  Virtex-5  FPGA  with  all  the  connections  labeled.  Data  was  not  only 
collected  from  the  top  of  the  FPGA,  but  it  was  also  collected  from  the  bottom  of  the 
board  as  well. 


JTAG 

Programmer 

Port 


O-Scope 

Trigger 

Port 


Figure  13.  RSA  Encryption  System  Connections 


Figure  14  shows  the  typical  Inspector  console  output  for  one  cycle  of  RSA  En¬ 
cryption  with  fixed  key,  modulus,  and  plaintext.  The  first  line  contains  the  “set  key” 
command  (40h),  data  length  (00  40h  =  64  bytes),  and  the  remaining  64  bytes  of 
data  is  the  512-bit  key  value.  The  second  line  is  the  echo  from  the  RSA  Encryp¬ 
tion  System.  Line  three  contains  the  “set  modulus”  command  (42h),  data  length  (00 
40h  =  64  bytes),  and  the  remaining  64  bytes  of  data  is  the  512-bit  modulus  value. 
The  fourth  line  is  the  echo  from  the  RSA  Encryption  System.  Line  five  contains  the 
“hardware  RSA  encryption”  command  (45h),  data  length  (00  40h  =  64  bytes),  and 
the  remaining  64  bytes  of  data  is  the  512-bit  plaintext  value.  The  sixth  line  is  the 
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Red -Command  (one  byte) _ 

Blue  -  Data  Length  (two  bytes) _ 

Yellow  -  Data  to  RSA  Encryption  System  (64  bytes) 

Green  -  Data  from  RSA  Encryption  System  (64  bytes) 

Grey  -  Console  Output _ 

Angle  Bracket  -  Data  Flow  Direction _ 


Figure  14.  Inspector  console  output  for  a  round  of  RSA  Encryption  using  sample  key 
and  modulus  from  [37] 


ciphertext  response  from  the  RSA  Encryption  System.  The  seventh  line  is  an  echo 
from  Inspector  to  the  console  indicating  the  ciphertext  received  from  the  FPGA  via 
the  serial  port.  Inspector  has  the  capability  to  randomize  the  plaintext  or  key  value 
before  sending  it  to  the  RSA  Encryption  System.  Retransmission  of  the  key  and 
modulus  is  not  required  if  every  execution  cycle  is  using  the  same  key  and  modulus. 

4.2  Attack  Methodology  (Top  vs.  Bottom  and  Probe  Type) 

Initially,  a  small  key  and  plaintext  were  used  to  ensure  the  circuitry  worked  prop¬ 
erly.  It  was  soon  discovered  that  using  a  characteristic  key  and  plaintext  size  (i.e., 
512-Bits  or  more)  creates  considerably  larger  EM  power  levels  since  the  circuitry  con¬ 
ducts  more  switching.  The  flux  EM  signal  of  a  hardware  circuit  implementation  is 
directly  proportional  to  the  amplitude  of  the  input  voltage,  which  is  directly  propor¬ 
tional  to  the  switching  activity  of  the  circuit  [35] .  When  using  a  small  key  the  circuit 
has  less  switching  activity  and  is  easier  to  hack  [45]  due  to  the  reduced  set  of  possible 
key  guesses.  For  this  research  a  realistic  full  length  512-bit  key  is  used. 

During  initial  attacks  on  the  FPGA  it  was  discovered  that  the  bottom  of  the 
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Virtex-5  FPGA  contains  an  array  of  internal  power  coupling  capacitors  (VCCINT), 
shown  in  Figure  15.  These  capacitors  not  only  provide  primary  internal  power  to 
the  FPGA,  but  tend  to  leak  EM  information.  To  compound  this  issue  the  System 
Monitor  coupling  capacitor  is  found  in  the  direct  center  of  this  array.  The  System 
Monitor  function  is  capable  of  measuring  physical  operating  parameters  like  on-chip 
power  supply  voltages.  The  EM  information  leaked  by  these  capacitors  relate  directly 
to  the  operations  being  performed  by  the  FPGA  and  provide  an  optimal  avenue  for 
passive  EMA  of  the  circuit’s  power  consumption. 


Figure  15.  Capacitors  on  the  bottom  of  the  Virtex-5  FPGA 

The  Virtex-5  FPGA  is  designed  using  a  Flip-Chip  design,  shown  in  Figure  10,  in 
order  to  reduce  the  EM  emissions  and  effects  from  noise  [11].  In  an  attempt  to  increase 
the  EM  signal  quality  collected  from  the  top  of  the  FPGA  the  copper  heatspreader 
was  removed  from  the  top  of  the  chip.  Figure  16  shows  the  Virtex-5  chip  before  and 
after  the  copper  heatspreader  was  removed.  In  a  typical  passive  attack,  the  chip  would 
not  likely  be  depackaged  it  was  necessary  to  make  an  informed  decision  on  how  to 
attack  the  circuit.  The  removal  of  the  heatspreader  did  not  provide  a  very  noticeable 
change  in  signal  strength  for  the  Willtek  probe.  The  reduced  signal  strength  on  top 
is  partially  due  to  the  chip  design. 
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(a)  Chip  Capped  (b)  Chip  Decapped 


Figure  16.  Depackaging  the  Virtex-5  Chip 


As  shown  in  Figure  17,  the  signal  strength  between  top  and  bottom  of  the  FPGA  is 
considerably  different.  The  signal  from  the  bottom  is  much  more  defined  and  allowing 
SEMA  to  be  easily  conducted.  Due  to  the  lower  sensitivity  and  reduced  gain  of  the 
Willtek  probe  it  was  better  suited  for  attacking  the  larger  signals  on  the  bottom  of 
the  FPGA.  The  Riscure®probe  is  more  sensitive  than  the  Willtek  probe  and  becomes 
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(b)  EM  signal  taken  from  bottom  of  board 

Figure  17.  Comparison  of  EM  signal  strength  using  Willtek  probe 


saturated  if  used  to  collect  data  from  the  bottom  of  the  board.  The  Riscure  probe  is 
more  adapted  for  smaller  and  more  localized  signals.  When  used  to  attack  the  circuit 
from  the  top  it  gets  a  much  better  signal  than  the  Willtek  probe.  Figure  18  shows 
the  difference  in  EM  signal  power  levels  between  the  two  probes.  The  Willtek  probe 
from  the  bottom  gets  almost  the  double  the  power  amplitude  when  compared  to  the 
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Riscure  probe  from  the  top. 
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(a)  EM  signal  taken  with  the  Riscure  probe  from  top  of  board 
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(b)  EM  signal  taken  with  the  Willtek  probe  from  bottom  of  board 

Figure  18.  Comparison  of  EM  signal  strength  between  probes 


The  majority  of  the  data  was  collected  by  the  Willtek  probe  from  the  bottom  of 
the  FPGA,  which  is  shown  in  Figure  19. 


Figure  19.  EM  Inductive  Probe  on  Virtex-5  FPGA 
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4.3  RSA  Encryption  Development 


The  RSA  Encryption  System  was  developed  using  behavioral  VHDL  and  imple¬ 
mented  on  a  Virtex-5  FPGA.  The  512-Bit  RSA  algorithm  was  designed  using  the 
Right-to-Left  Square-and-Multiply  Algorithm  (see  Algorithm  3)  to  implement  mod¬ 
ular  exponentiation.  The  system  components  for  my  RSA  Encryption  System  are 
shown  in  Figure  20  and  are  briefly  described  below.  Figure  50  in  Appendix  A  shows 
a  flowchart  of  how  my  RSA  Encryption  System  works. 


Figure  20.  RSA  Encryption  System 


PowerPC  440  Processor:  The  on-board  PowerPC  in  the  FPGA  handles  the  simple 
I/O  functions  to  transfer  the  plaintext,  modulus,  and  key  inputs  via  a  serial 
communication  to  a  PC.  In  addition,  the  PowerPC  handles  synchronization  of 
Riscure’s  Inspector  software  with  the  RSA  Encryption  Sytem. 

Processor  Local  Bus(PLB):  32-Bit  interface  between  the  PowerPC  and  the  IP 
cores. 
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512-Bit  RSA  Encryption:  VHDL  description  of  a  512-Bit  RSA  Algorithm  imple¬ 
mented  on  the  FPGA. 

Serial  Control  Port:  The  serial  I/O  is  used  to  transfer  externally  generated  key, 
modulus,  and  plaintext  to  the  PowerPC  to  start  the  encryption  cycle.  At  the 
end  of  each  encryption  cycle,  serial  I/O  is  used  to  transfer  the  ciphertext  back 
to  a  commercial  SCA  software  package  called  Inspector. 

Block  RAM  (BRAM):  BRAM  is  a  configurable  memory  module  on  the  FPGA 
where  the  PowerPC  instructions  are  stored. 

LCD:  The  LCD  displays  the  number  of  encryption  cycles  that  have  occurred  during 
system  operation. 

RSA  was  initially  developed  with  two  equivalent  Montgomery  modular  multiplier 
instantiations,  shown  in  Figure  6,  to  execute  the  square  and  multiply  operations  for 
the  circuit.  This  circuit  became  Version  A  as  described  in  Appendix  C.  The  two 
modular  multipliers  were  programmed  to  execute  simultaneously  in  time  for  each 
loop  of  the  modular  exponentiation  process  in  order  to  save  time  in  the  execution 
process.  Once  data  was  collected  from  the  EM  side  channel  of  the  circuit  it  was 
noticed  that  this  circuit  structure  caused  a  very  dynamic  difference  in  EM  signal 
levels.  Figure  21  shows  the  first  500/rs  of  the  512-bit  trace,  shown  in  Figure  53  of 
Appendix  B.  This  trace  represents  the  first  design  iteration  that  executed  the  square 
and  multiply  operations  at  the  same  time  and  caused  a  signal  that  leaked  a  lot  of 
sensitive  information.  The  SEMA  of  this  trace  is  also  shown  in  Figure  21  and  shows 
how  the  private  key  can  simply  be  read  directly  from  the  trace  (right  to  left).  The 
first  bit  in  the  Right-to-Left  Square-and-Multiply  Algorithm  appears  as  though  it 
only  conducts  a  square  operation  resulting  in  a  0-bit,  but  that  is  only  because  the 
message  is  multiplied  by  a  “1”  so  that  the  first  multiply  operation  can  be  skipped. 
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Additionally,  RSA  keys  are  relatively  prime  to  <f>(n)  and  will  always  be  odd.  This 
means  the  LSB  is  assumed  to  be  “1”. 


*1 0  mVoll  trace  0 
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Figure  21.  SEMA  of  Combined  Square  and  Multiply  Trace  with  key  (5  8E  B5)  -  RSA 
Version  A 


In  the  second  iteration  of  Hardware  RSA  code,  the  square  and  multiply  operations 
were  changed  to  operate  separately  in  time,  but  using  the  same  two  separate  modular 
multiplier  instantiations.  This  version  became  the  baseline  circuit  for  all  future  testing 
and  was  named  Version  B  as  described  in  Appendix  C.  Figure  22  shows  the  first 
500/is  of  the  512-bit  trace,  shown  in  Figure  54  of  Appendix  B.  This  trace  represents 


Figure  22.  SEMA  of  Separate  Square  and  Multiply  Trace  with  key  (E  B5)  -  RSA 
Version  B 


the  second  design  iteration  where  the  square  and  multiply  operations  were  separated 
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in  time.  In  this  circuit  version,  a  square  operation  ( di  =  0)  and  a  square  followed  by 
a  multiply  operation  (di  =  1)  can  be  easily  identified  since  the  multiply  operations 
leak  approximately  25 mV  more  power  at  peak  amplitude.  The  SEMA  of  this  trace 
is  also  shown  in  Figure  22  and  shows  how  the  private  key  can  be  easily  extrapolated 
from  the  trace.  The  apparent  leakage  of  the  trace  can  also  be  seen  after  some  signal 
processing  to  accentuate  the  dips  in  the  signal  in  Figure  23. 


Figure  23.  Separate  Square  and  Multiply  Trace  after  signal  processing  -  RSA  Version 
B 


The  second  iteration  of  Hardware  RSA  had  some  inherent  timing  variations  due  to 
how  the  multiply  operations  were  performed.  The  timing  of  the  multiply  operations 
were  based  on  the  number  of  bits  between  the  first  and  last  1-bit  in  the  multiplier 
input  and  varied  a  few  clock  cycles  between  each  operation.  This  misalignment  in  the 
timing  of  a  randomized  input  caused  some  difficulty  in  conducting  DEMA  to  extract 
the  key  from  the  baseline  circuit  version.  Due  to  the  difficulty  in  conducting  DEMA 
an  alternate  version  was  created  that  had  a  constant  timing  based  on  a  fixed  length 
counter.  Figure  24  shows  the  Erst  500/is  of  the  512-bit  trace,  shown  in  Figure  55  of 
Appendix  B.  This  trace  shows  the  statistical  average  of  1,000  resampled  and  signal 
processed  traces  using  a  random  plaintext  input.  The  statistical  average  package 
removes  all  the  differences  between  the  1,000  traces  and  creates  a  trace  containing 
the  average  power  level  for  them  all.  Since  the  multiply  operations  use  constant  timing 
the  SCA  software  can  easily  align  and  extract  the  statistical  differences  between  the 
traces.  The  next  step  after  characterizing  the  baseline  EM  signal  was  to  develop  the 
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polymorphic  circuit  to  create  the  CLOAK  countermeasure. 
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Figure  24.  Separate  Square  and  Multiply  Trace  after  signal  processing  using  constant 
timing  -  RSA  Version  2B 


4.4  Polymorphic  Circuit  Development 

The  polymorphic  circuit  development  to  create  the  Polymorphic  CES  was  broken 
down  into  three  steps: 

•  Flatten  Power  Signature 

•  Randomize  Power  Signature 

•  Randomize  Circuit  Timing 

4.4.1  Flatten  Power  Signature. 

The  first  step  in  the  development  of  the  polymorphic  circuit  design  is  to  flatten 
the  EM  power  signature  of  the  RSA  circuit.  Even  though  the  square  and  multiply 
operations  were  separated  in  time  they  still  were  using  two  separate,  but  equivalent 
modular  multiplier  instantiations.  This  enabled  the  circuit  to  still  leak  valuable  infor¬ 
mation  about  the  functions  being  executed.  Therefore,  the  third  iteration  of  hardware 
RSA  was  an  attempt  to  flatten  the  power  signature  between  the  square  and  multiply 
operations.  This  was  accomplished  by  creating  one  modular  multiplier  instantiation 
within  the  circuitry  to  execute  both  the  square  and  multiply  operations.  This  circuit 
became  Version  C  as  described  in  Appendix  C. 
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Figure  25  shows  the  first  500/us  of  the  512-bit  trace,  shown  in  Figure  56  of  Ap¬ 
pendix  B.  This  trace  represents  the  third  design  iteration  that  uses  one  modular 
multiplier  instantiation  to  execute  the  square  and  multiply  operations  in  order  to 
level  the  circuit’s  power  consumption.  The  timing  boundaries  for  the  square  and 
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Figure  25.  SEMA  overlaid  on  Level  Power  Consumption  Trace  with  key  (E  B5)  -  RSA 
Version  C 


multiply  operations  of  this  trace  are  also  shown  in  Figure  25,  which  shows  how  the 
private  key  can  no  longer  easily  be  extrapolated  from  the  original  trace  using  SEMA. 

However,  after  some  signal  processing  to  accentuate  the  dips  in  the  signal  between 
the  square  and  multiply  operations,  shown  in  Figure  26,  more  information  about  the 
trace  can  be  extrapolated.  Signal  processing  allows  the  trace  to  be  compartmented, 
but  DEMA  needs  to  be  conducted  to  identify  a  square  versus  a  multiply  operation. 
The  timing  boundaries  for  this  trace  can  also  be  seen  in  Figure  26.  SEMA  can  no 
longer  be  conducted  on  the  trace  so  the  square  and  multiply  labels  are  superimposed 
on  the  trace  as  a  courtesy.  As  you  can  see  from  the  trace,  none  of  the  modular 
operations  correlate  with  any  others  in  order  to  identify  a  square  from  a  multiply. 

This  circuit  design  provides  a  compact  and  low  power  cryptographic  solution  ca¬ 
pable  of  providing  some  protection  against  side  channel  attacks.  Due  to  a  decreased 
hardware  utilization  of  24.6%  over  the  previous  implementation;  the  clock  speed  was 
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Figure  26.  SEMA  overlaid  on  Level  Power  Consumption  Trace  with  key  (E  B5)  after 
signal  processing  -  RSA  Version  C 

doubled  to  40 MHz.  What  if  the  adversary  capable  of  aligning  the  trace  based  on  the 
dips  in  the  power  trace?  In  order  to  characterize  this  possibility  the  circuit  design  was 
modified  to  have  a  constant  timing  based  on  a  fixed  length  counter.  Figure  27  shows 
the  first  500ps  of  the  512-bit  trace,  shown  in  Figure  58  of  Appendix  B.  This  trace 
shows  the  statistical  average  of  1,000  resampled  and  signal  processed  traces  using  a 
random  plaintext  input.  Figure  27  shows  a  trace  with  a  potential  1.5 mV  difference 
in  power  levels  between  a  square  and  multiply  operations.  This  trace  is  only  made 
possible  by  using  a  multiplier  with  constant  timing.  In  actuality,  the  adversary  would 
not  be  able  to  modify  the  circuitry  to  collect  these  traces. 
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Figure  27.  Level  Power  Consumption  Trace  after  signal  processing  using  constant 
timing  -  RSA  Version  2C 


Instead,  the  adversary  would  use  a  function  similar  to  an  Elastic  Alignment1 

1  An  Elastic  Alignment  is  a  trace  alignment  module  contained  in  Inspector  capable  of  local  stretch- 


60 


to  attempt  to  align  the  traces.  Using  an  elastic  alignment  in  conjunction  with  a 
statistical  average  on  approximately  1,000  random  plaintext  traces  yields  the  first 
500/us  of  the  512-bit  trace,  shown  in  Figure  28.  The  full  512-bit  trace  can  be  found  in 
Appendix  B,  Figure  59.  Now  that  the  power  signatures  are  relatively  level  they  can 
be  randomized. 


Figure  28.  Level  Power  Consumption  Trace  after  elastic  alignment  and  average  -  RSA 
Version  C 


4.4.2  Randomize  Power  Signature. 

The  second  step  in  the  development  of  the  polymorphic  circuit  design  is  to  ran¬ 
domize  the  EM  power  signature.  Once  the  power  signature  was  relatively  flattened 
between  multiply  and  square  operations,  the  power  levels  within  each  modular  mul¬ 
tiply  operation  was  then  varied  in  a  psuedo-random  manner.  The  fourth  and  fifth 
iteration  of  the  hardware  RSA  implemented  two  and  three  different  adders,  respec¬ 
tively.  Figure  29(a)  shows  the  polymorphic  multiplier  with  a  blue  box  where  the 
polymorphic  adder  in  Figure  29(b)  is  placed  in  the  design.  The  first  adder  used  the 
standard  VHDL  add  operator,  the  second  a  ripple  carry  adder,  and  the  third  a  carry 
look-ahead  (CLA)  adder.  This  changed  the  possible  power  levels  for  each  addition 
operation  to  three  different  levels.  Therefore,  using  DEMA  would  be  more  difficult 
since  the  adders  used  in  each  of  the  operations  have  three  possible  correlation  levels 

ing  and  compression  of  traces  at  multiple  points  based  on  a  reference  trace  for  synchronization 
necessary  for  DPA  or  DEMA. 


61 


for  each  possible  input  tripling  the  complexity.  These  circuits  became  Version  D  and 
E  respectively  as  described  in  Appendix  C. 


(a)  Polymorphic  Modular  Multiplier  (b)  Polymorphic  Adder 


Figure  29.  Polymorphic  Modular  Multiplier  with  Polymorphic  Adder 


Figure  30  shows  the  first  500/rs  of  the  512-bit  trace  shown  in  Figure  60  of  Appendix 
B.  This  trace  represents  the  fifth  design  iteration  that  uses  one  modular  multiplier 
instantiation  that  contains  multiple  adders  in  order  to  randomize  the  circuit’s  power 
consumption.  The  timing  boundaries  for  the  square  and  multiply  operations  of  this 
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Figure  30.  SEMA  overlaid  on  Randomized  Power  Consumption  Trace  with  key  (E  B5) 
-  RSA  Version  E 


trace  are  also  shown  in  Figure  30,  which  shows  how  the  private  key  can  no  longer 
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easily  be  extrapolated  from  the  trace  using  SEMA.  The  square  and  multiply  labels 
are  superimposed  on  the  trace  as  a  courtesy  to  the  reader. 

After  signal  processing  to  accentuate  the  dips  in  the  signal,  shown  in  Figure  31, 
more  information  is  revealed  about  the  trace.  Signal  processing  allows  the  trace  to 
be  compartmented,  but  DEMA  needs  to  be  conducted  to  identify  a  square  versus  a 
multiply  operation.  The  timing  boundaries  for  this  trace  can  also  be  seen  in  Figure 
31,  though  SEMA  can  no  longer  easily  be  conducted.  As  you  can  see,  each  multi¬ 
ply  and  square  operation  has  a  randomized  power  signature  and  each  operation  is 
different  from  the  next  revealing  little  about  the  operation  being  performed.  With 
a  software  implementation,  the  operations  would  have  regular  patterns,  but  in  this 
hardware  implementation  the  operations  are  dependent  on  the  inputs  to  the  multiplier 
instantiation. 
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Figure  31.  SEMA  overlaid  on  Randomized  Power  Consumption  Trace  with  key  (E  B5) 
after  signal  processing  -  RSA  Version  E 

Using  an  elastic  alignment  in  conjunction  with  a  statistical  average  on  approxi¬ 
mately  1,000  random  plaintext  traces  yields  the  Erst  500/is  of  the  512-bit  trace,  shown 
in  Figure  32.  For  this  design  the  timing  is  not  randomized  enough  at  the  beginning 
of  the  trace  so  in  the  first  500ps  we  can  differentiate  the  squares  from  the  multiplies. 
The  full  512-bit  trace  can  be  found  in  Appendix  B,  Figure  62.  The  full  trace  shows 
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that  after  the  first  500/zs  the  dips  in  the  signal  are  still  apparent,  but  a  square  cannot 
be  distinguished  from  a  multiply  operation.  Naturally,  these  results  leads  to  the  next 
logical  step,  which  is  randomizing  circuit  timing. 


Figure  32.  Randomized  Power  Consumption  Trace  after  elastic  alignment  and  average 
-  RSA  Version  E 


4.4.3  Randomize  Circuit  Timing. 

The  third  step  in  the  development  of  the  polymorphic  circuit  design  is  to  use  a 
64-bit  LFSR  to  pseudo-randomly  adjust  the  execution  times  of  each  of  the  multiply 
operations.  Initially  the  multiplier  execution  had  a  slight  randomness  since  the  exe¬ 
cution  time  ended  after  the  multiplier  input  shifted  past  the  most  significant  1-bit, 
but  only  offered  a  couple  of  clock  cycles  of  variation  between  each  multiply  operation. 
First,  a  64-bit  LFSR  was  created  using  the  feedback  polynomial  shown  in  Equation  2. 
This  PRNG  generates  approximately  2n  —  1  =  264  —  1  =  18.5  x  1018  possible  different 
numbers  before  repeating.  Selected  bits  of  the  generated  number  were  then  used  to 
make  execution  timing  decisions  shown  in  Figure  33. 

x 64  +  x63  +  x 61  +  x60  +  1  (2) 


The  timing  flowchart  shown  in  Figure  33  shows  that  at  the  beginning  of  each 
multiply  or  square  operation,  the  first  decision  (green  decision)  is  executed.  This 
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Figure  33.  Pseudo-Random  Timing  Flow  Chart 

decision  is  based  on  the  two  LSBs  of  the  LFSR  and  will  execute  an  increased  pseudo¬ 
random  timing  reduction  approximately  75%  of  the  time  (i.e. ,  both  bits  are  0-bits). 
Otherwise,  the  original  timing  reductions  based  on  the  number  of  bits  between  the 
first  and  last  1-bit  in  the  multiplier  input  will  be  executed.  The  increased  timing 
reductions  involve  with  the  second  and  third  decisions.  The  second  decision  (blue 
decision)  checks  if  the  three  LSBs  of  the  multiplier  input  are  zeros.  If  the  three  LSBs 
are  not  zeros  there  is  no  change  in  the  timing,  but  if  they  are  zeros  then  the  circuitry 
moves  to  the  third  decision  (red  decision).  The  third  decision  makes  a  binary  decision 
on  whether  or  not  to  skip  a  clock  cycle  based  on  the  LSB  of  the  LFSR  and  executes 
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50%  of  the  time.  A  clock  cycle  in  the  multiply  operation  is  skipped  by  simply  shifting 
the  multiplier  right  two  bits  and  the  multiplicand  left  two  bits  instead  of  the  normal 
one  bit. 

The  final  CLOAK  modular  multiplier  is  shown  in  Figure  34  with  a  red  box  showing 
where  the  circuitry  that  randomizes  the  timing  of  the  multiplier  operations.  Using  the 
logic  shown  in  the  timing  flow  chart,  the  execution  time  of  each  multiply  operation 
was  able  to  be  randomized  from  approximately  490  ±  25  dock  cycles.  Given  the 
clock  frequency  of  the  RSA  circuit  is  running  at  20 MHz,  the  timing  of  each  multiply 
operation  is  randomized  from  24  ±  1.25 /is.  In  the  grand  scheme  of  things  that  does 
not  seem  like  much  time,  but  this  makes  it  very  difficult  to  align  multiple  traces  even 
if  the  attacker  is  able  to  acquire  multiple  traces  using  the  same  key,  modulus,  and 
plaintext. 


Figure  34.  CLOAK'Niodvda.v  Multiplier 


Figure  35  shows  the  first  500/is  of  the  512-bit  trace  shown  in  Figure  63  of  Appendix 
B.  This  trace  represents  the  sixth  design  iteration  that  uses  one  modular  multiplier 
instantiation  that  contains  multiple  adders  in  order  to  randomize  the  circuit’s  power 
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consumption  as  well  as  circuitry  to  randomize  the  multiplier  execution  time.  This 
circuit  became  Version  F  as  described  in  Appendix  C. 


Figure  35.  Randomized  Power  and  Timing  Trace  -  RSA  Version  F 


The  timing  boundaries  for  the  square  and  multiply  operations  of  this  trace  are 
also  randomized  in  order  to  reduce  the  ability  for  an  attacker  to  align  the  traces 
for  DEMA.  The  timing  boundaries  for  the  square  and  multiply  operations  cannot 
be  determined  from  the  trace  shown  in  Figure  35.  The  timing  boundaries  for  each 
trace  is  slightly  different  from  each  other  and  also  differs  from  previous  versions  of 
RSA.  This  trace  shows  how  the  private  key  can  no  longer  easily  be  extrapolated  from 
the  trace  using  SEMA  and  if  you  took  several  traces  side  by  side  there  is  no  way  to 
effectively  align  the  traces  for  DEMA. 

After  signal  processing  to  accentuate  the  dips  in  the  signal,  shown  in  Figure  36, 
more  information  is  revealed  about  the  trace.  Signal  processing  allows  the  trace  to 
become  more  compartmented,  but  as  you  can  see  the  dips  in  this  trace  are  not  as 
defined  as  in  previous  versions.  We  know  that  DEMA  needs  to  be  conducted  to 
identify  a  square  versus  a  multiply  operation,  but  the  SCA  software  has  a  difficult 
time  aligning  the  traces  since  the  multiplier  execution  time  has  been  randomized  at 
different  points  within  each  multiply  cycle.  The  timing  boundaries  for  this  trace 
can  also  be  seen  in  Figure  36  and  square  and  multiply  labels  are  superimposed  for 
convenience,  though  SEMA  can  no  longer  easily  be  conducted.  As  you  can  see, 
each  multiply  and  square  operation  has  a  randomized  power  signature  as  well  as 
randomized  circuit  timing.  Each  operation  is  different  from  the  next,  revealing  very 
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little  about  the  operation  being  performed. 
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Figure  36.  SEMA  overlaid  on  Randomized  Power  and  Timing  Trace  with  key  (E  B5) 
after  signal  processing  -  RSA  Version  F 


Using  an  elastic  alignment  in  conjunction  with  a  statistical  average  on  approx¬ 
imately  1,000  traces  with  a  random  plaintext  applied  to  a  fixed  key  and  modulus 
yields  the  first  500/is  of  the  512-bit  trace,  shown  in  Figure  37.  For  this  design,  the 


Figure  37.  Randomized  Power  and  Timing  Trace  after  elastic  alignment  and  average  - 
RSA  Version  F 


timing  is  randomized  more  than  before  so  that  in  the  first  500/rs  we  cannot  align  the 
signals  enough  to  differentiate  the  squares  from  the  multiplies.  The  full  512-bit  trace 
can  be  found  in  Appendix  B,  Figure  65.  Once  the  Polymorphic  CES  development 
was  completed  the  results  were  then  analyzed. 


4.5  Protected  RSA  SCA  Results 


Figure  38  shows  a  comparison  of  EM  signals  from  three  circuit  versions  for  a 
typical  square  and  multiply  operation  after  signal  processing.  An  example  comparison 
of  EM  signals  before  signal  processing  can  be  seen  in  Figure  66  in  Appendix  B.  The 


(a)  Baseline  Circuit  EM  signature  -  RSA  Version  B 


(b)  Level  Power  Circuit  EM  signature  -  RSA  Version  C 


(c)  Randomized  Power  Circuit  EM  signature  -  RSA  Version  E 


Figure  38.  Comparison  of  EM  signals  for  square  and  multiply  operations  after  signal 
processing 

three  sample  square  and  multiply  operations  were  taken  from  traces  at  the  same  time 
period  using  the  same  key,  modulus,  and  plaintext  so  the  operations  are  identical 
for  each  trace.  These  traces  show  the  differences  in  the  signatures  of  a  characteristic 
square  and  multiply  operation.  In  actuality,  no  multiply  or  square  operation  has  the 
same  signature  as  any  other  within  the  same  trace  since  the  multiplier  operation  is 
a  function  of  its  inputs.  This  characteristic  is  exploited  in  the  polymorphic  circuit 
design  by  randomizing  power  and  then  randomizing  the  timing  within  the  multiply 
function  and  therefore  the  overall  encryption  cycle. 

Once  these  randomized  power  signatures  are  coupled  with  randomized  circuit  tim- 
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ing,  the  complexity  of  DEMA  is  compounded.  A  square  is  virtually  indistinguishable 
from  a  multiply.  Figure  39  shows  a  comparison  of  EM  signals  from  the  Polymorphic 
Circuit  using  randomized  power  and  timing.  These  traces  show  the  same  multiply 


(a)  Randomized  Power  and  Timing  Trace  0 


(b)  Randomized  Power  and  Timing  Trace  1 


(c)  Randomized  Power  and  Timing  two  traces  overlaid 


(d)  Randomized  Power  and  Timing  two  aligned  traces  overlaid 


Figure  39.  Comparison  of  EM  signals  for  square  and  multiply  operations  using  identical 
inputs  -  RSA  Version  F 

and  square  operation  with  the  same  inputs  as  the  traces  in  shown  in  Figure  38.  The 
traces  show  the  same  square  and  multiply  operations  from  two  different  traces  using 
the  same  circuitry  and  identical  inputs.  However,  the  two  traces  are  not  that  similar. 
Trace  0  and  Trace  1  are  shown  in  Figure  39(a)  and  39(b)  respectively.  Figure  39(c) 
shows  the  two  traces  overlaid  in  real  time  and  Figure  39(d)  shows  the  two  traces  after 
they  were  aligned  to  have  the  same  start  time.  The  two  traces  show  the  two  very 
different  power  signatures  and  timing  even  though  they  are  both  operating  on  iden- 
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tical  inputs.  With  hardware  circuitry  operating  on  the  same  inputs  the  trace  would 
typically  have  virtually  identical  signals. 

Figure  40  shows  the  same  square  and  multiply  operations  from  the  protected  and 
unprotected  circuits.  With  the  unprotected  circuit  in  Figure  40(a)  the  identical  inputs 
produce  two  virtually  identical  signals.  However,  with  the  protected  circuit  in  Figure 
40(b)  the  two  traces  are  very  different  and  we  are  unable  to  correlate  a  square  from 
a  multiply  operation.  This  shows  visually  how  different  the  two  traces  are,  but  next 
we  will  find  out  how  different  they  really  are. 


(a)  Two  Separate  Square  and  Multiply  Traces  (Baseline)  -  RSA  Version  B 


(b)  Two  Randomized  Power  and  Timing  Traces  (CLOAKe d)  -  RSA  Version  F 

Figure  40.  Comparison  of  square  and  multiply  operations  using  identical  inputs 

Figure  41  shows  the  difference  in  standard  deviation  for  the  same  protected  and 
unprotected  square  and  multiply  operations  using  identical  inputs.  The  trace  from 
the  CLOAKed  circuit  in  Figure  41(b)  shows  23  times  or  2300%  increase  in  standard 
deviation  over  the  baseline  trace  shown  in  Figure  41(a). 

To  decrease  the  noise  in  the  signal  a  band  pass  filter  was  applied  to  20M Hz  ±  1  MHz 
to  remove  all  signals  above  and  below  the  hardware  operating  frequency.  Figure  42 
shows  the  difference  in  standard  deviation  for  the  same  protected  and  unprotected 
square  and  multiply  operations  using  identical  inputs.  The  trace  from  the  CLOAKed 
circuit  in  Figure  42(b)  now  shows  28  times  or  2800%  increase  in  standard  deviation 
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(a)  Separate  Square  and  Multiply  Trace  (Baseline)  -  RSA  Version  B 


(b)  Randomized  Power  and  Timing  Trace  (CLOAKed)  -  RSA  Version  F 

Figure  41.  Comparison  of  standard  deviation  of  square  and  multiply  operations  using 
identical  inputs 


over  the  baseline  trace  shown  in  Figure  42(a).  So  after  we  applied  a  filter  to  reduce 
noise  the  standard  deviation  increased  even  more. 


(a)  Separate  Square  and  Multiply  Trace  (Baseline)  -  RSA  Version  B 


(b)  Randomized  Power  and  Timing  Trace  (CLOAKed)  -  RSA  Version  F 


Figure  42.  Comparison  of  standard  deviation  of  square  and  multiply  operations  after 
frequency  filtering 

Additionally,  a  standard  deviation  of  the  entire  trace  was  calculated  across  100 
traces  using  identical  inputs.  Figure  43  shows  the  difference  in  standard  deviation 
for  the  first  500/us  of  the  full  protected  and  unprotected  traces.  The  trace  from  the 
CLOAKed  circuit  in  Figure  43(b)  shows  18  times  or  1800%  increase  in  standard 
deviation  over  the  baseline  trace  shown  in  Figure  43(a). 
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mVolt 

Standard  deviation 

(a)  Separate  Square  and  Multiply  Trace  (Baseline)  -  RSA  Version  B 


(b)  Randomized  Power  and  Timing  Trace  (CLOAK  e d)  -  RSA  Version  F 


Figure  43.  Comparison  of  standard  deviation  of  100  traces  using  identical  inputs 

4.6  Virtex-6  FPGA  Investigations 

The  Polymorphic  CES  design  was  designed,  implemented,  and  tested  thoroughly 
on  the  Virtex-5  FPGA,  but  due  to  size  limitations  the  hardware  implementations 
of  both  AES  and  RSA  were  not  able  to  coexist  on  the  same  chip.  The  CES  was 
then  adapted  to  run  on  the  newer  40nm  process  technology  the  Virtex-6  FPGA  is 
built  on.  This  modification  was  conducted  not  only  to  characterize  the  difference 
in  side  channel  signatures,  but  also  to  create  a  fully  functional  Encryption  System 
testbed.  The  system  was  initially  implemented  with  the  baseline  RSA  circuit  named 
Version  B  as  described  in  Appendix  C.  The  system  also  contained  a  hardware  and 
software  implementation  of  AES.  Figure  51  in  Appendix  A  shows  a  flowchart  of  how 
the  Encryption  System  works  on  the  Virtex-6  FPGA.  Even  with  hardware  versions 
of  AES  and  512-bit  RSA  both  implemented  on  the  FPGA  the  Virtex-6  was  only  24% 
utilized. 

The  Virtex-6  FPGA  has  a  similar  Flip-Chip  design  as  the  Virtex-5  FPGA  aside 
from  being  build  using  smaller  process  technology.  In  addition,  the  Virtex-6  FPGA 
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comes  equipped  with  a  cooling  fan  and  heat  sink,  shown  in  Figure  44,  installed  on 
the  copper  heatspreader.  The  presence  of  this  fan  assembly  on  top  of  the  FPGA 
limits  the  ability  of  a  non-invasive  attack  on  the  circuit  from  the  top.  It  was  also 
discovered  that  this  fan  while  powered  up  adds  an  average  of  15 mV  of  noise  to  the 
already  reduced  signal  strength  so  for  initial  data  collection  the  fan  was  turned  off. 


Figure  44.  Fan  Assembly  on  Virtex-6  FPGA 


Figure  45  shows  the  EM  probe  on  the  bottom  of  the  Virtex-6  FPGA  where  all  data 
collection  occurred.  Although  the  Willtek  probe  is  shown  in  the  picture  the  Riscure 
probe  was  also  used  for  data  collection  from  the  bottom  of  the  Virtex-6  FPGA. 


Figure  45.  EM  Inductive  Probe  on  Virtex-6  FPGA 
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4.6.1  Unprotected  RSA  SC  A  Results. 


Initially  the  Encryption  System  was  implemented  on  the  Virtex-6  FPGA  so  that 
the  hardware  RSA  operated  at  the  same  25 MHz  frequency  as  the  MicroBlaze  and  the 
hardware  AES.  The  traces  shown  in  Figure  46  show  a  comparison  of  the  first  500/is  of 
the  512-bit  trace  shown  in  Figures  67  and  68  in  Appendix  B.  These  traces  represent  the 
baseline  RSA  Version  B  circuit,  where  the  square  and  multiply  operations  operate 
separately  in  time,  but  use  two  separate  modular  multiplier  instantiations.  These 
traces  show  the  difference  in  signals  when  other  circuitry  is  running  on  the  system 
at  the  same  clock  frequency  and  cannot  be  filtered  out  of  the  trace.  Due  to  the 
added  noise  generated  by  the  MicroBlaze  running  at  the  same  frequency  as  the  RSA 
hardware  in  Figure  46(a)  the  square  and  multiply  operations  are  not  differentiable 
with  the  naked  eye.  Using  the  Willtek  probe  the  EM  signal  strength  was  considerably 
lower  than  with  the  Riscure  probe,  but  was  able  to  filter  out  the  noise  and  focus  more 
on  the  FPGA  operations  due  to  the  wider  resolution  of  the  Willtek  probe. 
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(a)  Separate  Square  and  Multiply  Trace  using  a  same  clock  -  RSA  Version  B 
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(b)  Separate  Square  and  Multiply  Trace  using  a  different  clock  -  RSA  Version  4B 

Figure  46.  Comparison  of  EM  signals  for  Separate  Square  and  Multiply  Trace  after 
signal  processing  using  fixed  plaintext  on  a  Virtex-6  FPGA 


Using  a  statistical  average  on  approximately  100  random  plaintext  traces  yields 
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the  first  500/us  of  the  512-bit  trace,  shown  in  Figure  47.  The  full  512-bit  trace  can 
be  found  in  Appendix  B,  Figure  69  and  70.  The  full  trace  shows  that  after  the  first 
500/is  the  square  and  multiply  operations  are  still  apparent  on  the  Virtex-6  signals 
though  the  dips  in  the  signal  are  not  as  apparent. 
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(a)  Separate  Square  and  Multiply  Trace  using  a  same  clock  -  RSA  Version  B 
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(b)  Separate  Square  and  Multiply  Trace  using  a  different  clock  -  RSA  Version  4B 


Figure  47.  Comparison  of  EM  signals  for  Separate  Square  and  Multiply  Trace  after 
signal  processing  using  random  plaintext  on  a  Virtex-6  FPGA 


4.6.2  Protected  RSA  SCA  Results. 

The  initial  design  for  the  Polymophic  CES  on  the  Virtex-6  FPGA  was  imple¬ 
mented  so  that  the  hardware  RSA  operated  at  the  same  25 MHz  frequency  as  the 
MicroBlaze  and  the  hardware  AES.  The  traces  shown  in  Figure  48  show  a  comparison 
of  the  first  500;US  of  the  512-bit  trace  shown  in  Figures  71  and  72  in  Appendix  B. 
These  traces  represent  the  RSA  Version  F  circuit,  where  the  CLOAK  countermeasure 
randomizes  power  and  timing  for  the  RSA  circuit.  These  traces  show  the  difference  in 
signals  when  other  circuitry  is  running  on  the  system  at  the  same  clock  frequency  and 
cannot  be  filtered  out  of  the  trace.  Even  after  the  noise  generated  by  the  MicroBlaze 
hardware  was  removed  from  the  signal  shown  in  Figure  48(b)  the  square  and  multiply 


76 


operations  are  not  differentiable  with  the  naked  eye  and  therefore  SEMA  cannot  be 
conducted. 
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(a)  Randomized  Power  and  Timing  Trace  using  a  same  clock  -  RSA  Version  F 
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(b)  Randomized  Power  and  Timing  Trace  using  a  different  clock  -  RSA  Version  4F 


Figure  48.  Comparison  of  EM  signals  for  Randomized  Power  and  Timing  Trace  after 
signal  processing  using  fixed  plaintext  on  a  Virtex-6  FPGA 

Using  a  statistical  average  on  approximately  100  random  plaintext  traces  yields 
the  first  500ps  of  the  512-bit  trace,  shown  in  Figure  49.  The  full  512-bit  trace  can 
be  found  in  Appendix  B,  Figure  73  and  74.  The  full  trace  shows  that  after  the  first 
500/ns  the  dips  in  the  signal  are  still  apparent,  but  a  square  cannot  be  distinguished 
from  a  multiply  operation. 

The  Virtex-6  FPGA  had  an  overall  lower  power  signal  than  the  Virtex-5  FPGA. 
This  lower  power  signal  made  it  more  difficult  to  process  and  differentiate  the  traces, 
but  the  results  were  still  the  same.  The  square  and  multiply  operations  are  not  differ¬ 
entiable.  Therefore,  Polymorphic  CES  successfully  obfuscates  the  EM  side-channel 
on  both  the  Virtex-5  and  Virtex-6  FPGAs. 
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(a)  Randomized  Power  and  Timing  Trace  using  a  same  clock  -  RSA  Version  F 
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(b)  Randomized  Power  and  Timing  Trace  using  a  different  clock  -  RSA  Version  4F 


Figure  49.  Comparison  of  EM  signals  for  Randomized  Power  and  Timing  Trace  after 
signal  processing  using  random  plaintext  on  a  Virtex-6  FPGA 

4.7  Design  Comparison 

Table  2  shows  a  comparison  of  design  execution  time  and  size  for  each  of  the  pri¬ 
mary  versions  of  RSA  used  on  the  Virtex-5  FPGA  in  this  research.  Version  descrip¬ 
tions  are  detailed  in  Appendix  C.  The  separate  square  and  multiply  circuit  (Version 
B)  is  used  as  the  baseline  circuit  for  all  design  comparisons  after  increasing  execution 
time  by  51%  over  the  original.  The  level  power  consumption  circuit  (Version  C)  per¬ 
formed  very  well  by  providing  a  decent  side  channel  signature  that  can  not  be  easily 
attacked  while  still  providing  a  24.6%  decrease  in  size  and  a  50%  decrease  in  execution 
time.  After  the  CLOAK  countermeasure  was  implemented  the  circuit  execution  time 
decreased  by  3.7%  and  increased  size  by  11.1%,  flip  flops  by  10.1%,  and  LUTs  by 
34.3%.  Table  3  shows  how  the  designs  in  this  research  compare  to  similar  hardware 
designs  being  used.  When  the  designs  are  compared  on  equal  ground  the  only  design 
that  compares  was  Blum  in  [21],  where  the  execution  times  were  only  estimated  and 
were  not  validated  on  an  FPGA  with  a  fully  functional  implementation  of  RSA. 

Table  4  shows  a  comparison  of  execution  time  and  size  for  the  primary  versions  of 
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Table  2.  RSA  Design  Execution  Time  and  Size  for  Virtex-5  FPGA 


Version 

Execution  Time 1 

Utilization 

Flip  Flops 

LUTs 

Combined  Sq  &  Mult 

A 

12.92ms/128/ts 

80% 

10871 

21008 

Separate  Sq  &  Mult 

B 

19.5ms/180/ns 

81% 

10873 

21014 

Level  Power 

C 

9. 75ms/ 180  /is2 

65% 

9336 

14299 

Randomized  Power  1 

D 

19.5ms/180/ns 

70% 

10365 

17470 

Randomized  Power  2 

E 

19.5ms/180/ns 

85% 

11395 

24619 

CLOAKed 

F 

18.8ms/174/ns 

90% 

11973 

28214 

1  All  circuits  were  tested  with  a  20 MHz  clock  rate.  First  execution  time  uses  a  normal  512-bit  key 
size  and  the  second  uses  a  short  key  of  15h. 

2  RSA  Version  C  was  capable  of  an  increased  clock  rate  of  40 MHz  reducing  the  execution  time  by 
half  (19.5ms  @  20 MHz). 


Table  3.  Comparison  of  RSA  Design  Execution  Time 


Execution  Time 1 

Clock  Speed 

FPGA  Type 

Hardware  w/o  CLOAK 

19.5ms/180/ts 

20  MHz 

Virtex-5 

Hardware  w/  CLOAK 

18.8ms/174/ts 

20  MHz 

Virtex-5 

Hardware  w/o  CLOAK 

11.7ms/108/ts 

33.3  MHz 

Virtex-6 

Hardware  w/  CLOAK 

11.3ms/104.3/ts 

33.3  MHz 

Virtex-6 

SASEBO-R  [47] 

1, 689ms/16.5ms2 

2  MHz 

Virtex-II  Pro 

SASEBO-R  [49] 

138ms/1.35ms3 

24  MHz 

Virtex-II  Pro 

Blum  High  Radix  [21] 

2.93ms/110p,s4 

48  MHz 

Virtex-5 

Blum  [21] 

9.38ms/350yus4 

48  MHz 

Virtex-5 

1  Execution  time  given  for  normal  512-bit  key  size  and  small  key  size. 

2  Total  running  time  is  not  given  for  SASEBO-R  so  running  time  is  estimated  based  on 
modular  multiplication  execution  time  of  2.2 ms  x  1.5n  (n  =  512-bits  /  n  =  5-bits) 

3  Total  running  time  is  not  given  for  SASEBO-R  so  running  time  is  estimated  based  on 
modular  multiplication  execution  time  of  ~  180 ys  x  1.5 n  ( n  =  512-bits / n  =  5-bits) 

4  RSA  implementation  execution  times  are  estimated  and  were  not  validated  on  an 
actual  FPGA. 


RSA  used  on  the  Virtex-6  FPGA.  Version  descriptions  are  also  included  in  Appendix 
C.  The  table  shows  that  the  Virtex-6  FPGA  design  is  capable  of  implementing  larger 
designs,  but  also  at  higher  frequencies. 
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Table  4.  RSA  Design  Execution  Time  and  Size  for  Virtex-6  FPGA 


Version 

Execution  Time 1 

Utilization 2 

Flip  Flops 

LUTs 

Separate  Sq  &  Mult 

B 

11.7ms/108/is 

24% 

10876 

19307 

Level  Power 

C 

11.7ms/108/is 

20% 

9854 

14117 

CLOAKed 

F 

11.3ms/104.3/is 

28% 

11978 

26384 

1  All  circuit  times  are  given  at  33.3 MHz.  Initial  tests  used  a  25 MHz  clock  rate,  but  all  circuits 
were  capable  of  33.3MHz.  First  execution  time  uses  a  normal  512-bit  key  size  and  the  second 
uses  a  short  key  of  15h. 

2  Device  utilization  shows  the  number  of  occupied  slices  in  the  entire  system  to  include  the  AES 
hardware. 


4.8  Results  Summary 

The  initial  RSA  design  implementation  (Version  A)  executed  square  and  multiply 
operations  at  the  same  time,  but  this  caused  a  dynamic  contrast  of  50 mV  between 
a  1-bit  and  0-bit  in  the  key.  This  extreme  circuit  leakage  is  not  acceptable  in  en¬ 
cryption/decryption  circuitry.  The  baseline  RSA  circuit  (Version  B)  used  separate 
square  and  multiply  operations  using  equal,  but  separate  multiplier  instantiations. 
This  multiply  operation  in  this  baseline  circuit  still  used  25 mV  more  peak  power 
than  square  operations.  Efforts  were  then  conducted  to  implement  the  polymorphic 
circuit  design  using  three  basic  steps. 

The  first  step  in  polymorphic  circuit  design  development  was  to  flatten  the  power 
signature  (Version  C).  This  step  involved  the  use  of  one  hardware  multiplier  instanti¬ 
ation  to  conduct  both  multiply  and  square  operations.  In  this  circuit  design  none  of 
the  modular  operations  correlate  with  any  others  in  order  to  identify  a  square  from 
a  multiply,  at  least  not  using  SEMA.  The  second  step  in  polymorphic  circuit  design 
development  took  this  idea  further  by  randomizing  the  power  signature  (Version  D 
and  E).  This  step  exploited  the  fact  that  Montgomery  modular  multipliers  use  a  series 
of  add  and  subtract  operations  to  conduct  modular  multiplication.  By  implementing 
three  adders  within  the  single  multiplier  instantiation  the  power  levels  within  each 
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modular  multiply  operation  are  varied  between  the  three  adders  thus  complicating 
the  DEMA  process  by  adding  three  additional  power  levels  for  each  addition  oper¬ 
ation.  The  third  and  final  step  in  polymorphic  circuit  design  was  to  randomize  the 
circuit  timing  (Version  F)  to  eliminate  the  ability  to  successfully  align  the  square  and 
multiply  operations  for  DEMA.  In  this  design  the  square  and  multiply  operations 
could  not  be  successfully  aligned  in  order  to  differentiate  a  square  from  a  multiply 
operation  successfully  creation  a  polymorphic  circuit  design.  Even  if  the  attacker  is 
able  to  acquire  multiple  traces  using  a  fixed  key,  modulus,  and  plaintext  each  trace 
will  have  have  a  different  side  channel  signature. 

After  the  Polymorphic  CES  design  was  tested  thoroughly  on  the  Virtex-5  FPGA 
the  CES  was  then  adapted  to  run  on  the  Virtex-6  FPGA.  This  modification  was 
conducted  not  only  to  characterize  the  difference  in  side  channel  signatures,  but  also 
to  create  a  fully  functional  Encryption  System  testbed.  The  Virtex-6  FPGA  imple¬ 
mentations  had  a  lower  overall  signal  strength  due  to  the  larger  chip  size  and  lower 
power  levels,  but  the  CLOAK  countermeasure  was  able  to  successfully  obfuscate  the 
EM  side  channel. 
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V.  Conclusion 


This  research  effort  has  determined  that  a  polymorphic  circuit  design  can  be  cre¬ 
ated  that  varies  circuit  power  consumption  and  timing  can  protect  a  cryptographic 
device  from  EMA  attacks.  This  chapter  summarizes  the  research  effort  by  identifying 
and  summarizing  the  objectives  that  were  met,  presents  conclusions  that  were  deter¬ 
mined,  discusses  contributions  to  the  field  of  study,  and  establishes  the  foundations 
for  future  achievements. 

5.1  Completed  Objectives 

/  Polymorphic  Circuit  Design:  A  Polymorphic  Montgomery  Modular  Multiplier 
was  developed  and  implemented  as  a  CLOAK  countermeasure  within  the  modu¬ 
lar  exponentiation  circuitry  of  a  hardware  implementation  of  RSA  cryptographic 
algorithm  that  was  capable  of  changing  the  way  they  function  in  both  power 
consumption  level  and  circuit  timing. 

/  Implement  RSA  with  Reconfiguration:  The  RSA  encryption/decryption  algo¬ 
rithm  was  implemented  on  the  Virtex-5  and  Virtex-6  FPGAs  using  VHDL.  The 
FPGA  based  Encryption  System  designs  were  designed  as  testbed  systems  on 
their  respective  platforms  to  analyze  RSA  side  channel  signatures  with  and 
without  the  polymorphic  circuit  design  implemented. 

/  Analyze  Side  Channel  Signatures:  The  EM  side  channel  signatures  of  the  RSA 
algorithm  was  analyzed  before  and  after  implementation  of  the  CLOAK  coun¬ 
termeasure  in  order  to  characterize  the  circuit’s  ability  to  resist  timing  attacks. 
Culminate  in  a  proof  of  concept  polymorphic  circuit  design  that  enhances  the 
systems  protective  countermeasures  by  obfuscating  circuit  operations  from  SCA. 


82 


5.2  Conclusions 


This  research  has  determined  that  a  polymorphic  circuit  design  that  varies  circuit 
power  consumption  and  timing  can  protect  a  cryptographic  device  from  an  EMA  at¬ 
tacks.  This  goal  was  achieved  by  varying  the  EM  signature  of  the  RSA  cryptographic 
algorithm  in  such  a  way  that  the  observer /adversary  would  be  unable  to  correlate 
side  channel  signature  to  the  specific  cryptographic  functions  being  executed.  Specif¬ 
ically  this  research  designed  and  implemented  a  512-bit  hardware  version  of  RSA 
using  modular  exponentiation  on  the  Virtex-5  and  Virtex-6  FPGAs  using  VHDL. 
This  cryptographic  circuit  is  capable  of  obfuscating  the  vulnerable  square  and  mul¬ 
tiply  operations  within  the  algorithm  in  order  to  effectively  CLOAK  their  function 
from  side  channel  attack.  As  you  can  see  in  Figure  40(b),  given  two  traces  using 
identical  inputs  on  the  CLOAKed  circuit  gives  us  two  completely  different  side  chan¬ 
nel  signatures  for  the  given  square  and  multiply  operations.  The  standard  deviation 
of  the  square  and  multiply  operation  for  the  CLOAKed  circuit  was  shown  to  be  23 
times  or  2300%  greater  than  the  baseline  trace.  Additionally,  the  standard  deviation 
of  100  full  traces  of  the  CLOAKed  circuit  were  shown  to  be  18  times  or  1800%  greater 
than  100  baseline  traces.  Hardware  obfuscation  is  very  difficult  to  accomplish  and 
in  doing  so  there  can  be  considerable  size  and  performance  penalty  to  a  more  secure 
design.  For  this  research  there  was  an  increase  in  execution  time  of  51%  when  the 
square  and  multiply  operations  were  separated  in  time,  but  this  measure  was  pivotal 
in  creating  a  baseline  circuit  to  obfuscate  circuit  functions.  Once  the  CLOAK  coun¬ 
termeasure  was  implemented  the  execution  time  decreased  by  3.7%  and  size  increased 
by  11.1%.  In  the  end  we  were  successfully  able  to  obfuscate  the  hardware  functions 
of  the  cryptographic  algorithm. 
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5.3  Contributions 


•  Developed  a  512-bit  hardware  implementation  of  the  RSA  public-key  encryp¬ 
tion/decryption  algorithm  using  modular  exponentiation  to  be  used  on  the 
Virtex-5  or  Virtex-6  FPGAs. 

•  Incorporated  a  hardware  implementation  of  the  AES  symmetric-key  encryption 
algorithm  on  the  Virtex-5  and  Virtex-6  FPGAs. 

•  Created  an  FPGA  based  Encryption  System  testbed,  named  Polymorphic  CES, 
to  be  used  for  side  channel  analysis  on  hardware  and  software  based  crypto¬ 
graphic  algorithms.  Baseline  system  incorporates  a  hardware  implementation 
of  AES  and  RSA  as  well  as  a  software  implementation  of  AES.  On  the  Virtex-5 
Encryption  System  version  the  hardware  instantiations  of  RSA  and  AES  are 
implemented  separately.  The  Virtex-6  Encryption  System  version  combines 
both  hardware  instantiations  of  RSA  and  AES  into  one  fully  functional  design 
capable  of  running  multiple  versions  of  each. 

•  Designed,  implemented,  and  tested  the  CLOAK  countermeasure. 

•  Conducted  EM  side  channel  analysis  of  hardware  based  RSA  before  and  after 
implementation  of  the  CLOAK  countermeasure. 

5.4  Future  Work 

•  Randomizing  the  order  of  execution  for  square  and  multiply  operations  within 
each  step  of  the  modular  exponentiation  process. 

•  Development  of  a  random  RSA  key  generation  module  using  Java  within  In¬ 
spector  or  C-code  on  the  PowerPC  in  order  to  be  incorporated  in  the  trace 
acquisition  process  of  our  FPGA  based  Encryption  System  testbed. 
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•  Development  of  custom  DPA/DEMA  module  within  Inspector  or  Matlab  capa¬ 
ble  of  automating  the  key  extraction  process  for  a  hardware  based  RSA  trace. 

•  Addition  of  circuitry  for  blinding  the  exponent. 

•  Addition  of  noise  maker  circuitry,  such  as  oscillators,  to  further  obfuscate  the 
EM  signal  from  the  adversary.  The  MicroBlaze  served  this  function  in  the 
Virtex-6  investigations.  Note:  Oscillators  need  to  operate  at  the  same  frequency 
as  the  RSA  circuit  otherwise  they  can  be  easily  filtered  out  of  the  trace  during 
signal  processing. 

•  The  latest  research  in  Digital  Fingerprinting  of  hardware  allows  for  the  gen¬ 
eration  of  hardware  specific  keys  linked  to  specific  pieces  of  hardware.  These 
keys  can  be  used  to  provide  a  private  key  for  decryption  operations  or  digital 
signatures  in  algorithms  such  as  AES  or  RSA.  Using  such  a  hardware  specific 
key  for  RSA  would  require  a  little  more  signal  processing  due  to  the  RSA  key 
generation  requirements.  This  key  could  also  be  used  to  drive  a  polymorphic 
function  that  would  essentially  disable  the  circuit  functionality  if  attempts  were 
made  to  copy  the  system  bitstream  to  another  FPGA. 

5.5  Summary 

This  research  provides  contributions  in  the  information  protection  and  tamper 
protection  of  MCTs.  The  documented  results  and  analysis  of  the  CLOAK  counter¬ 
measure  confirms  that  it  successfully  obfuscates  the  EM  side  channel  of  the  hardware 
based  RSA  circuitry  on  the  Virtex-5  as  well  as  the  Virtex-6  FPGA.  Also,  the  devel¬ 
opment  of  the  experimental  testbed  unit  provides  the  functionality  to  test  encryption 
system  vulnerabilities  and  countermeasures  on  an  FPGA  based  system  while  having 
complete  access  to  the  source  code. 
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Appendix  A.  Encryption  System  Flowcharts 
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Figure  50.  RSA  Encryption  System  Flowchart  on  the  Virtex-5  FPGA 
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Figure  51.  Encryption  System  Flowchart  on  the  Virtex-6  FPGA 
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Appendix  B.  RSA  Traces 


2.1  Virtex-5  FPGA  Traces 


Figure  53.  Full  512-bit  Combined  Square  and  Multiply  Trace  -  RSA  Version  A 


Figure  54.  Full  512-bit  Separate  Square  and  Multiply  Trace  -  RSA  Version  B 


Figure  55.  Full  512-bit  Separate  Square  and  Multiply  Trace  using  constant  timing  - 
RSA  Version  2B 


Figure  56.  Full  512-bit  Level  Power  Consumption  Trace  -  RSA  Version  C 


Figure  57.  Full  512-bit  Level  Power  Consumption  Trace  after  signal  processing  -  RSA 
Version  C 


Figure  58.  Full  512-bit  Level  Power  Consumption  using  constant  timing  -  RSA  Version 
2C 
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Figure  59.  Full  512-bit  Level  Power  Consumption  Trace  after  elastic  alignment  and 
average  -  RSA  Version  C 


Figure  60.  Full  512-bit  Randomized  Power  Consumption  Trace  -  RSA  Version  E 
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Figure  61.  Full  512-bit  Randomized  Power  Consumption  Trace  after  signal  processing 
-  RSA  Version  E 
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Figure  62.  Full  512-bit  Randomized  Power  Consumption  Trace  after  elastic  alignment 
and  average  -  RSA  Version  E 


Figure  63.  Full  512-bit  Randomized  Power  and  Timing  Trace  -  RSA  Version  F 
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Figure  64.  Full  512-bit  Randomized  Power  and  Timing  Trace  after  signal  processing  - 
RSA  Version  F 
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Figure  65.  Full  512-bit  Randomized  Power  and  Timing  Trace  after  elastic  alignment 
and  average  -  RSA  Version  F 
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(a)  Baseline  Circuit  EM  signature  -  RSA  Version  B 
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(b)  Level  Power  Circuit  EM  signature  -  RSA  Version  C 
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(c)  Randomized  Power  Circuit  EM  signature  -  RSA  Version  E 

Figure  66.  Comparison  of  EM  signals  for  square  and  multiply  operations 
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2.2  Virtex-6  FPGA  Traces 


Figure  67.  Full  512-bit  Separate  Square  and  Multiply  Trace  after  signal  processing 
using  fixed  plaintext  -  RSA  Version  B  on  a  Virtex-6  FPGA 
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Figure  68.  Full  512-bit  Separate  Square  and  Multiply  Trace  after  signal  processing 
using  a  different  clock  and  fixed  plaintext  -  RSA  Version  4B  on  a  Virtex-6  FPGA 


Figure  69.  Full  512-bit  Separate  Square  and  Multiply  Trace  after  signal  processing 
using  random  plaintext  -  RSA  Version  B  on  a  Virtex-6  FPGA 


Figure  70.  Full  512-bit  Separate  Square  and  Multiply  Trace  after  signal  processing 
using  a  different  clock  and  random  plaintext  -  RSA  Version  4B  on  a  Virtex-6  FPGA 
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Figure  71.  Full  512-bit  Separate  Square  and  Multiply  Trace  after  signal  processing 
using  fixed  plaintext  -  RSA  Version  F  on  a  Virtex-6  FPGA 
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Figure  72.  Full  512-bit  Separate  Square  and  Multiply  Trace  after  signal  processing 
using  a  different  clock  and  fixed  plaintext  -  RSA  Version  4F  on  a  Virtex-6  FPGA 
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Figure  73.  Full  512-bit  Separate  Square  and  Multiply  Trace  after  signal  processing 
using  random  plaintext  -  RSA  Version  F  on  a  Virtex-6  FPGA 
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Figure  74.  Full  512-bit  Separate  Square  and  Multiply  Trace  after  signal  processing 
using  a  different  clock  and  random  plaintext  -  RSA  Version  4F  on  a  Virtex-6  FPGA 
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Appendix  C.  Version  Control 


AES  Version  A  (aes_hw_vl_00_a):  Working  version  of  the  non-iterative  128-bit 
Hardware  AES  code. 

AES  Version  B  (  aes_hw_vl_00_b):  Working  version  of  the  iterative  128-bit  Hard¬ 
ware  AES  code. 

RSA  Version  A  (rsa_hw_vl_00_a):  First  working  version  of  512-bit  Hardware  RS A 
code.  Square  and  Multiply  are  executed  at  the  same  time. 

RSA  Version  B  (rsa  hw  vl  00  b):  Second  iteration  of  the  512-bit  Hardware  RSA 
code.  Square  and  Multiply  are  executed  at  the  different  times,  but  using  two 
different  instantiations  for  the  square  and  multiply  operations. 

RSA  Version  2B  (rsa  hw  v2  00  b):  Same  as  the  previous  except  the  multiply 
operations  all  have  the  same  execution  time. 

RSA  Version  4B  (rsa_hw_v4_00_b):  Same  as  Version  B  except  it  uses  an  external 
clock  input  so  the  MicroBlaze  on  the  Virtex-6  can  run  at  a  different  clock 
frequency. 

RSA  Version  C  (rsa_hw_vl_00_c):  Third  iteration  of  the  512-bit  Hardware  RSA 
code.  Square  and  Multiply  are  executed  at  the  different  times,  and  uses  only 
one  instantiation  of  the  multiplier  for  the  square  and  multiply  operations.  This 
is  the  baseline  starting  point  for  implementing  the  CLOAK  countermeasure. 

RSA  Version  2C  (rsa_hw_v2_00_c):  Same  as  the  previous  except  the  multiply 
operations  all  have  the  same  execution  time. 

RSA  Version  3C  (rsa  hw  v3  00  c):  Same  as  Version  C  except  except  it  uses  runs 
1024-Bit  Hardware  RSA  Code  on  the  Virtex-6. 
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RSA  Version  4C  (rsa_hw_v4_00_c):  Same  as  Version  C  except  it  uses  an  external 
clock  input  so  the  MicroBlaze  on  the  Virtex-6  can  run  at  a  different  clock 
frequency. 

RSA  Version  D  (rsa  hw  vl  00  d):  Fourth  iteration  of  the  512-bit  Hardware  RSA 
code.  Square  and  Multiply  are  executed  at  the  different  times,  and  uses  only 
one  instantiation  of  the  multiplier  for  the  square  and  multiply  operations.  Each 
multiplier  implements  two  adder  circuits  (one  being  the  VHDL  add  operator 
and  the  second  being  a  ripple  carry  adder).  This  is  the  first  implementation  of 
the  CLOAK  countermeasure 

RSA  Version  E  (rsa  hw  vl  00  e):  Fifth  iteration  of  the  512-bit  Hardware  RSA 
code.  Square  and  Multiply  are  executed  at  the  different  times,  and  uses  only 
one  instantiation  of  the  multiplier  for  the  square  and  multiply  operations.  Each 
multiplier  implements  three  adder  circuits  (one  being  the  VHDL  add  operator, 
second  being  a  ripple  carry  adder,  and  third  being  a  carry  look-ahead  adder). 
This  is  the  first  implementation  of  the  CLOAK  countermeasure. 

RSA  Version  2E  (rsa_hw_v2_00_e):  Same  as  the  previous  except  the  multiply  op¬ 
erations  all  have  the  same  execution  time. 

RSA  Version  F  (rsa  hw  vl  00  f):  Sixth  iteration  of  the  512-bit  Hardware  RSA 
code.  Square  and  Multiply  are  executed  at  the  different  times,  and  uses  only 
one  instantiation  of  the  multiplier  for  the  square  and  multiply  operations.  Each 
multiplier  implements  three  adder  circuits  (one  being  the  VHDL  add  operator, 
second  being  a  ripple  carry  adder,  and  third  being  a  carry  look-ahead  adder). 
Additionally,  this  implementation  randomizes  the  timing  of  each  multiply  op¬ 
eration  to  a  much  greater  extent  than  previous  versions.  Previous  versions 
varied  execution  based  on  the  distance  between  the  first  and  last  1  bit  within 
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the  multiplier.  This  is  the  second  and  final  implementation  of  the  CLOAK 
countermeasure . 

RSA  Version  4F  (rsa_hw_v4_00_f):  Same  as  Version  F  except  it  uses  an  external 
clock  input  so  the  MicroBlaze  on  the  Virtex-6  can  run  at  a  different  clock 
frequency. 
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Appendix  D.  Data  Sheets 

This  section  contains  selected  data  sheet  information  for  primary  data  collection 
devices. 

4.1  Willtek  1207  Inductive  Probe 

The  Willtex  1207  Inductive  Probe  [17]  is  an  active  probe  for  contactless  measure¬ 
ments.  The  probe  is  designed  to  the  following  specifications: 

•  Frequency  Range  -  50  MHz  to  4  GHz  Specified,  10  MHz  to  6  GHz  Usable 

•  Gain  -  20  dB  @  1  GHz 

4.2  Inspector  EM  Probe 

The  Inspector  EM  Probe  [12]  is  an  active  probe  for  contactless  measurements.  De¬ 
vice  also  contains  a  hardware  device,  called  an  EM  Shield  or  Field  Deflector,  designed 
to  reduce  unwanted  EM  Signals.  The  probe  is  designed  to  the  following  specifications: 

•  Sensitivity  @  1  MHz  -  20  MHz/1  /jT 

•  Bandwidth  -  1  GHz 

•  Resolution  -  1  mm 2 

4.3  Lecroy  WavePro  725Zi  Oscilloscope 

The  Lecroy  WavePro  725Zi  Digital  Storage  Oscilloscope  (DSO)  [13]  is  a  2.5  GHz, 
20  GS/s,  4  Ch,  10  Mpts/Ch  DSO  with  a  15.4”  WXGA  Color  Display. 

4.4  Agilent  E3631A  DC  Power  Supply 


The  Agilent  E3631A  [10]  is  a  triple  output  DC  Power  Supply. 


4.5  Virtex-5  FPGA 


The  Virtex-5  FXT  FPGA  ML507  Evaluation  Platform  [8]. 

4.6  Virtex-6  FPGA 

The  Virtex-6  LXT  FPGA  ML605  Evaluation  Platform  [9]. 
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